GuardBinding
A GuardBinding attaches a prompt-injection / jailbreak classifier to an agent's LLM traffic. When an agent references a guard, the gateway classifies incoming user prompts, assistant responses, and/or tool results before they reach the LLM — blocking or logging anything that looks like an injection attempt.
Guards are enforced in the gateway, reused across agents, and hot-reload when edited.
pai apply -f guard.yaml
pai get guards
pai describe guard <name>
pai delete guard <name>
How it works
- Each
GuardBindingpoints at a classifier HTTP endpoint. The classifier is pluggable — anything that accepts{text}and returns{label, score, labels}works. Pai ships two references: astubbackend for smoke tests and apromptGuardimage running Meta Prompt-Guard-86M. You can also pointclassifier.endpointat any external service (Llama-Guard, NeMo Guardrails, Lakera, Protect AI, a home-grown classifier) viatype: custom. - An agent opts in by listing guards in
spec.guards[]and selecting what to scan (prompts, responses, tool results from specific tools). - Pai materializes the resolved config into a shared store that the gateway polls. Policy changes propagate within ~1 minute — no restart required.
- On every LLM call, the gateway walks the message list. Any scan target is classified; on a violation,
enforcemode returns HTTP 403,auditmode logs the event and forwards. - Classifier timeouts or connection errors fail open — the request is forwarded and a
guard.unavailableaudit event is emitted.
Field reference
| Field | Required | Description |
|---|---|---|
classifier.type | Yes | promptGuard (default), llamaGuard, or custom |
classifier.endpoint | Yes | HTTP(S) URL the gateway POSTs to |
classifier.model | No | Model identifier (default: the backend's canonical model) |
classifier.auth | No | Optional secret-backed auth for reaching the classifier endpoint. Same shape as Provider.auth |
thresholds.injection | No | Score threshold for injection label (default: 0.9) |
thresholds.jailbreak | No | Score threshold for jailbreak label (default: 0.9) |
enforcement | No | audit (default) logs violations but forwards · enforce returns HTTP 403 |
timeoutMs | No | Per-call classifier timeout in ms (default: 500). Timeouts fail open. |
audit.savePayload | No | Store a sanitized copy of flagged content in guard.violation_* events (default: true). Set false to never persist payload content |
audit.maxPayloadChars | No | Maximum characters from the sanitized payload per event (default: 2048) |
A request is flagged if injection >= thresholds.injection or jailbreak >= thresholds.jailbreak.
Classifier types
The classifier.type field is a hint about backend semantics — the real integration point is the HTTP endpoint. Anything that accepts POST /classify {text} and returns the expected response shape is a valid classifier.
type | Shipped backend options | Notes |
|---|---|---|
stub | Deterministic substring matcher | Select at runtime with guard.type: stub (Helm values) or PAI_GUARD_TYPE=stub (env). Same pai-guard:latest image — no separate build. Safe for local dev, CI, and demos — no model load, no GPU, no Hugging Face dependency |
promptGuard | ProtectAI DeBERTa-v3 (default) or Meta Prompt-Guard-86M | Small transformer, CPU-friendly. ProtectAI is ungated on Hugging Face and baked into the default pai-guard:latest image at build time. Meta's Prompt-Guard-86M is an alternate — high quality but license-gated with manual review; rebuild with --build-arg PAI_GUARD_MODEL=meta-llama/Prompt-Guard-86M --build-arg HF_TOKEN=... to use it instead |
llamaGuard | Llama-Guard 3 | Larger policy classifier for content safety + injection. Run your own service or use any hosted Llama-Guard endpoint |
custom | Any HTTP service | Covers everything else — Lakera, Protect AI hosted, NeMo Guardrails, in-house classifiers, proprietary models. Must return {label, score, labels:{benign,injection,jailbreak}} |
Switching backends is a configuration change, not a code change. Change classifier.endpoint and classifier.type, apply, done — no gateway restart required (hot-reloaded). Switching the bundled pai-guard service between stub and promptGuard is controlled by guard.type in the Helm values — same image, different runtime behavior.
Example — bundled promptGuard (default)
The default pai-guard image bakes protectai/deberta-v3-base-prompt-injection-v2 at build time. Point classifier.endpoint at the bundled in-platform service and you're done — no token, no runtime downloads.
apiVersion: pai.io/v1
kind: GuardBinding
metadata:
name: prompt-guard-default
spec:
classifier:
type: promptGuard
endpoint: http://pai-guard.pai-system.svc.cluster.local/classify
model: protectai/deberta-v3-base-prompt-injection-v2
thresholds:
injection: 0.9
jailbreak: 0.9
enforcement: audit
timeoutMs: 2000
Example — external classifier
apiVersion: pai.io/v1
kind: GuardBinding
metadata:
name: prompt-guard-external
spec:
classifier:
type: custom
endpoint: https://api.example-guard.com/v1/classify
auth:
secretRef: prompt-guard-external-token
secretKey: token
header: Authorization
prefix: "Bearer "
thresholds:
injection: 0.85
jailbreak: 0.85
enforcement: enforce
timeoutMs: 1000
Secret required:
pai add secret prompt-guard-external-token --from-literal token=YOUR_API_KEY
Attaching guards to an agent
See the prompt injection guide for the full rollout workflow. The short version:
spec:
guards:
- binding: prompt-guard-default
scan:
prompts: true
responses: false
toolResults:
tools: [web_fetch, web_search]
enforcement: audit
Scan targets
| Field | Scans |
|---|---|
scan.prompts | Every user-role message in the request |
scan.responses | Assistant output. Streaming responses are audit-only (the stream has already been delivered; enforcement would require buffering) |
scan.toolResults.tools | Tool result content blocks whose originating tool name matches. ["*"] = all tools. Omit to disable. |
Tool-result scanning is the most valuable setting for indirect injection: it catches malicious content pulled from untrusted sources (web pages, GitHub issues, emails) before it reaches the model.
Enforcement override — tighten only
spec.guards[].enforcement may only tighten the binding's default:
audit→enforce: allowedenforce→audit: rejected at reconcile time (status.ready: false)
This prevents agent authors from silently weakening a policy set by the platform operator.
Replace semantics
When an Agent sets its own spec.guards[], it fully replaces any guards declared on the referenced Agent — matching how models overrides work. There is no merging by binding name.
Status
| Field | Description |
|---|---|
status.ready | True when the GuardBinding spec passes validation |
status.message | validated on success, or a human-readable error |
status.observedGeneration | The metadata.generation this status was computed for |
Metrics
The gateway exposes per-guard Prometheus metrics:
pai_guard_checks_total{workload, scanner, label, action}— classification counts.labelis one ofbenign | injection | jailbreak | unavailable;actionisforward | audit | enforce | fail_open.pai_guard_latency_seconds{scanner}— histogram of classifier round-trip times.
Audit events
Emitted to the gateway's tamper-evident audit log:
| Event | When |
|---|---|
guard.violation_enforce | Request blocked with HTTP 403 |
guard.violation_audit | Violation detected but forwarded (audit mode, or streaming response) |
guard.unavailable | Classifier timeout or connection error — request was forwarded |
Flagged payload capture
By default, both guard.violation_enforce and guard.violation_audit events include a sanitized copy of the flagged content under a payload field. The gateway runs a redaction pass before writing to the audit chain, replacing:
- Emails and phone numbers
- API keys (Anthropic / OpenAI
sk-/sk-ant-, GitHubghp_/github_pat_, GoogleAIza, Slackxox*) - AWS access key IDs (
AKIA...,ASIA...) - Credit card numbers (Luhn-checked)
- JWTs and PEM-encoded private keys
- Long hex (≥32 chars) and base64 (≥40 chars) blobs
Each match becomes a tag like [REDACTED:email] so the record still tells you what kind of sensitive data was nearby. Content is truncated to audit.maxPayloadChars (default 2048) with a [TRUNCATED:<original-length>] marker.
Benign traffic and guard.unavailable events never carry a payload. If you need to disable capture entirely, set audit.savePayload: false — the audit record still carries binding, label, score, where, and tool.
Regex-based redaction is best-effort. It catches the most common secret shapes but is not a substitute for not logging sensitive contexts. That's why it's tied to the violation path only (already-suspicious traffic), not to every prompt.