Skip to main content

GuardBinding

A GuardBinding attaches a prompt-injection / jailbreak classifier to an agent's LLM traffic. When an agent references a guard, the gateway classifies incoming user prompts, assistant responses, and/or tool results before they reach the LLM — blocking or logging anything that looks like an injection attempt.

Guards are enforced in the gateway, reused across agents, and hot-reload when edited.

pai apply -f guard.yaml
pai get guards
pai describe guard <name>
pai delete guard <name>

How it works

  1. Each GuardBinding points at a classifier HTTP endpoint. The classifier is pluggable — anything that accepts {text} and returns {label, score, labels} works. Pai ships two references: a stub backend for smoke tests and a promptGuard image running Meta Prompt-Guard-86M. You can also point classifier.endpoint at any external service (Llama-Guard, NeMo Guardrails, Lakera, Protect AI, a home-grown classifier) via type: custom.
  2. An agent opts in by listing guards in spec.guards[] and selecting what to scan (prompts, responses, tool results from specific tools).
  3. Pai materializes the resolved config into a shared store that the gateway polls. Policy changes propagate within ~1 minute — no restart required.
  4. On every LLM call, the gateway walks the message list. Any scan target is classified; on a violation, enforce mode returns HTTP 403, audit mode logs the event and forwards.
  5. Classifier timeouts or connection errors fail open — the request is forwarded and a guard.unavailable audit event is emitted.

Field reference

FieldRequiredDescription
classifier.typeYespromptGuard (default), llamaGuard, or custom
classifier.endpointYesHTTP(S) URL the gateway POSTs to
classifier.modelNoModel identifier (default: the backend's canonical model)
classifier.authNoOptional secret-backed auth for reaching the classifier endpoint. Same shape as Provider.auth
thresholds.injectionNoScore threshold for injection label (default: 0.9)
thresholds.jailbreakNoScore threshold for jailbreak label (default: 0.9)
enforcementNoaudit (default) logs violations but forwards · enforce returns HTTP 403
timeoutMsNoPer-call classifier timeout in ms (default: 500). Timeouts fail open.
audit.savePayloadNoStore a sanitized copy of flagged content in guard.violation_* events (default: true). Set false to never persist payload content
audit.maxPayloadCharsNoMaximum characters from the sanitized payload per event (default: 2048)

A request is flagged if injection >= thresholds.injection or jailbreak >= thresholds.jailbreak.

Classifier types

The classifier.type field is a hint about backend semantics — the real integration point is the HTTP endpoint. Anything that accepts POST /classify {text} and returns the expected response shape is a valid classifier.

typeShipped backend optionsNotes
stubDeterministic substring matcherSelect at runtime with guard.type: stub (Helm values) or PAI_GUARD_TYPE=stub (env). Same pai-guard:latest image — no separate build. Safe for local dev, CI, and demos — no model load, no GPU, no Hugging Face dependency
promptGuardProtectAI DeBERTa-v3 (default) or Meta Prompt-Guard-86MSmall transformer, CPU-friendly. ProtectAI is ungated on Hugging Face and baked into the default pai-guard:latest image at build time. Meta's Prompt-Guard-86M is an alternate — high quality but license-gated with manual review; rebuild with --build-arg PAI_GUARD_MODEL=meta-llama/Prompt-Guard-86M --build-arg HF_TOKEN=... to use it instead
llamaGuardLlama-Guard 3Larger policy classifier for content safety + injection. Run your own service or use any hosted Llama-Guard endpoint
customAny HTTP serviceCovers everything else — Lakera, Protect AI hosted, NeMo Guardrails, in-house classifiers, proprietary models. Must return {label, score, labels:{benign,injection,jailbreak}}

Switching backends is a configuration change, not a code change. Change classifier.endpoint and classifier.type, apply, done — no gateway restart required (hot-reloaded). Switching the bundled pai-guard service between stub and promptGuard is controlled by guard.type in the Helm values — same image, different runtime behavior.

Example — bundled promptGuard (default)

The default pai-guard image bakes protectai/deberta-v3-base-prompt-injection-v2 at build time. Point classifier.endpoint at the bundled in-platform service and you're done — no token, no runtime downloads.

apiVersion: pai.io/v1
kind: GuardBinding
metadata:
name: prompt-guard-default
spec:
classifier:
type: promptGuard
endpoint: http://pai-guard.pai-system.svc.cluster.local/classify
model: protectai/deberta-v3-base-prompt-injection-v2
thresholds:
injection: 0.9
jailbreak: 0.9
enforcement: audit
timeoutMs: 2000

Example — external classifier

apiVersion: pai.io/v1
kind: GuardBinding
metadata:
name: prompt-guard-external
spec:
classifier:
type: custom
endpoint: https://api.example-guard.com/v1/classify
auth:
secretRef: prompt-guard-external-token
secretKey: token
header: Authorization
prefix: "Bearer "
thresholds:
injection: 0.85
jailbreak: 0.85
enforcement: enforce
timeoutMs: 1000

Secret required:

pai add secret prompt-guard-external-token --from-literal token=YOUR_API_KEY

Attaching guards to an agent

See the prompt injection guide for the full rollout workflow. The short version:

spec:
guards:
- binding: prompt-guard-default
scan:
prompts: true
responses: false
toolResults:
tools: [web_fetch, web_search]
enforcement: audit

Scan targets

FieldScans
scan.promptsEvery user-role message in the request
scan.responsesAssistant output. Streaming responses are audit-only (the stream has already been delivered; enforcement would require buffering)
scan.toolResults.toolsTool result content blocks whose originating tool name matches. ["*"] = all tools. Omit to disable.

Tool-result scanning is the most valuable setting for indirect injection: it catches malicious content pulled from untrusted sources (web pages, GitHub issues, emails) before it reaches the model.

Enforcement override — tighten only

spec.guards[].enforcement may only tighten the binding's default:

  • auditenforce: allowed
  • enforceaudit: rejected at reconcile time (status.ready: false)

This prevents agent authors from silently weakening a policy set by the platform operator.

Replace semantics

When an Agent sets its own spec.guards[], it fully replaces any guards declared on the referenced Agent — matching how models overrides work. There is no merging by binding name.

Status

FieldDescription
status.readyTrue when the GuardBinding spec passes validation
status.messagevalidated on success, or a human-readable error
status.observedGenerationThe metadata.generation this status was computed for

Metrics

The gateway exposes per-guard Prometheus metrics:

  • pai_guard_checks_total{workload, scanner, label, action} — classification counts. label is one of benign | injection | jailbreak | unavailable; action is forward | audit | enforce | fail_open.
  • pai_guard_latency_seconds{scanner} — histogram of classifier round-trip times.

Audit events

Emitted to the gateway's tamper-evident audit log:

EventWhen
guard.violation_enforceRequest blocked with HTTP 403
guard.violation_auditViolation detected but forwarded (audit mode, or streaming response)
guard.unavailableClassifier timeout or connection error — request was forwarded

Flagged payload capture

By default, both guard.violation_enforce and guard.violation_audit events include a sanitized copy of the flagged content under a payload field. The gateway runs a redaction pass before writing to the audit chain, replacing:

  • Emails and phone numbers
  • API keys (Anthropic / OpenAI sk- / sk-ant-, GitHub ghp_ / github_pat_, Google AIza, Slack xox*)
  • AWS access key IDs (AKIA..., ASIA...)
  • Credit card numbers (Luhn-checked)
  • JWTs and PEM-encoded private keys
  • Long hex (≥32 chars) and base64 (≥40 chars) blobs

Each match becomes a tag like [REDACTED:email] so the record still tells you what kind of sensitive data was nearby. Content is truncated to audit.maxPayloadChars (default 2048) with a [TRUNCATED:<original-length>] marker.

Benign traffic and guard.unavailable events never carry a payload. If you need to disable capture entirely, set audit.savePayload: false — the audit record still carries binding, label, score, where, and tool.

Regex-based redaction is best-effort. It catches the most common secret shapes but is not a substitute for not logging sensitive contexts. That's why it's tied to the violation path only (already-suspicious traffic), not to every prompt.