GuardBinding

A GuardBinding attaches a prompt-injection / jailbreak classifier to an agent's LLM traffic. When an agent references a guard, the gateway classifies incoming user prompts, assistant responses, and/or tool results before they reach the LLM — blocking or logging anything that looks like an injection attempt.

Guards are enforced in the gateway, reused across agents, and hot-reload when edited.

pai apply -f guard.yaml
pai get guards
pai describe guard <name>
pai delete guard <name>

How it works

Each GuardBinding points at a classifier HTTP endpoint. The classifier is pluggable — anything that accepts {text} and returns {label, score, labels} works. Pai ships two references: a stub backend for smoke tests and a promptGuard image running Meta Prompt-Guard-86M. You can also point classifier.endpoint at any external service (Llama-Guard, NeMo Guardrails, Lakera, Protect AI, a home-grown classifier) via type: custom.
An agent opts in by listing guards in spec.guards[] and selecting what to scan (prompts, responses, tool results from specific tools).
Pai materializes the resolved config into a shared store that the gateway polls. Policy changes propagate within ~1 minute — no restart required.
On every LLM call, the gateway walks the message list. Any scan target is classified; on a violation, enforce mode returns HTTP 403, audit mode logs the event and forwards.
Classifier timeouts or connection errors fail open — the request is forwarded and a guard.unavailable audit event is emitted.

Field reference

Field	Required	Description
`classifier.type`	Yes	`promptGuard` (default), `llamaGuard`, or `custom`
`classifier.endpoint`	Yes	HTTP(S) URL the gateway POSTs to
`classifier.model`	No	Model identifier (default: the backend's canonical model)
`classifier.auth`	No	Optional secret-backed auth for reaching the classifier endpoint. Same shape as `Provider.auth`
`thresholds.injection`	No	Score threshold for `injection` label (default: 0.9)
`thresholds.jailbreak`	No	Score threshold for `jailbreak` label (default: 0.9)
`enforcement`	No	`audit` (default) logs violations but forwards · `enforce` returns HTTP 403
`timeoutMs`	No	Per-call classifier timeout in ms (default: 500). Timeouts fail open.
`audit.savePayload`	No	Store a sanitized copy of flagged content in `guard.violation_*` events (default: `true`). Set `false` to never persist payload content
`audit.maxPayloadChars`	No	Maximum characters from the sanitized payload per event (default: 2048)

A request is flagged if injection >= thresholds.injection or jailbreak >= thresholds.jailbreak.

Classifier types

The classifier.type field is a hint about backend semantics — the real integration point is the HTTP endpoint. Anything that accepts POST /classify {text} and returns the expected response shape is a valid classifier.

`type`	Shipped backend options	Notes
`stub`	Deterministic substring matcher	Select at runtime with `guard.type: stub` (Helm values) or `PAI_GUARD_TYPE=stub` (env). Same `pai-guard:latest` image — no separate build. Safe for local dev, CI, and demos — no model load, no GPU, no Hugging Face dependency
`promptGuard`	ProtectAI DeBERTa-v3 (default) or Meta Prompt-Guard-86M	Small transformer, CPU-friendly. ProtectAI is ungated on Hugging Face and baked into the default `pai-guard:latest` image at build time. Meta's Prompt-Guard-86M is an alternate — high quality but license-gated with manual review; rebuild with `--build-arg PAI_GUARD_MODEL=meta-llama/Prompt-Guard-86M --build-arg HF_TOKEN=...` to use it instead
`llamaGuard`	Llama-Guard 3	Larger policy classifier for content safety + injection. Run your own service or use any hosted Llama-Guard endpoint
`custom`	Any HTTP service	Covers everything else — Lakera, Protect AI hosted, NeMo Guardrails, in-house classifiers, proprietary models. Must return `{label, score, labels:{benign,injection,jailbreak}}`

Switching backends is a configuration change, not a code change. Change classifier.endpoint and classifier.type, apply, done — no gateway restart required (hot-reloaded). Switching the bundled pai-guard service between stub and promptGuard is controlled by guard.type in the Helm values — same image, different runtime behavior.

Example — bundled promptGuard (default)

The default pai-guard image bakes protectai/deberta-v3-base-prompt-injection-v2 at build time. Point classifier.endpoint at the bundled in-platform service and you're done — no token, no runtime downloads.

apiVersion: pai.io/v1
kind: GuardBinding
metadata:
  name: prompt-guard-default
spec:
  classifier:
    type: promptGuard
    endpoint: http://pai-guard.pai-system.svc.cluster.local/classify
    model: protectai/deberta-v3-base-prompt-injection-v2
  thresholds:
    injection: 0.9
    jailbreak: 0.9
  enforcement: audit
  timeoutMs: 2000

Example — external classifier

apiVersion: pai.io/v1
kind: GuardBinding
metadata:
  name: prompt-guard-external
spec:
  classifier:
    type: custom
    endpoint: https://api.example-guard.com/v1/classify
    auth:
      secretRef: prompt-guard-external-token
      secretKey: token
      header: Authorization
      prefix: "Bearer "
  thresholds:
    injection: 0.85
    jailbreak: 0.85
  enforcement: enforce
  timeoutMs: 1000

Secret required:

pai add secret prompt-guard-external-token --from-literal token=YOUR_API_KEY

Attaching guards to an agent

See the prompt injection guide for the full rollout workflow. The short version:

spec:
  guards:
    - binding: prompt-guard-default
      scan:
        prompts: true
        responses: false
        toolResults:
          tools: [web_fetch, web_search]
      enforcement: audit

Scan targets

Field	Scans
`scan.prompts`	Every user-role message in the request
`scan.responses`	Assistant output. Streaming responses are audit-only (the stream has already been delivered; enforcement would require buffering)
`scan.toolResults.tools`	Tool result content blocks whose originating tool name matches. `["*"]` = all tools. Omit to disable.

Tool-result scanning is the most valuable setting for indirect injection: it catches malicious content pulled from untrusted sources (web pages, GitHub issues, emails) before it reaches the model.

Enforcement override — tighten only

spec.guards[].enforcement may only tighten the binding's default:

audit → enforce: allowed
enforce → audit: rejected at reconcile time (status.ready: false)

This prevents agent authors from silently weakening a policy set by the platform operator.

Replace semantics

When an Agent sets its own spec.guards[], it fully replaces any guards declared on the referenced Agent — matching how models overrides work. There is no merging by binding name.

Status

Field	Description
`status.ready`	True when the GuardBinding spec passes validation
`status.message`	`validated` on success, or a human-readable error
`status.observedGeneration`	The `metadata.generation` this status was computed for

Metrics

The gateway exposes per-guard Prometheus metrics:

pai_guard_checks_total{workload, scanner, label, action} — classification counts. label is one of benign | injection | jailbreak | unavailable; action is forward | audit | enforce | fail_open.
pai_guard_latency_seconds{scanner} — histogram of classifier round-trip times.

Audit events

Emitted to the gateway's tamper-evident audit log:

Event	When
`guard.violation_enforce`	Request blocked with HTTP 403
`guard.violation_audit`	Violation detected but forwarded (audit mode, or streaming response)
`guard.unavailable`	Classifier timeout or connection error — request was forwarded

Flagged payload capture

By default, both guard.violation_enforce and guard.violation_audit events include a sanitized copy of the flagged content under a payload field. The gateway runs a redaction pass before writing to the audit chain, replacing:

Emails and phone numbers
API keys (Anthropic / OpenAI sk- / sk-ant-, GitHub ghp_ / github_pat_, Google AIza, Slack xox*)
AWS access key IDs (AKIA..., ASIA...)
Credit card numbers (Luhn-checked)
JWTs and PEM-encoded private keys
Long hex (≥32 chars) and base64 (≥40 chars) blobs

Each match becomes a tag like [REDACTED:email] so the record still tells you what kind of sensitive data was nearby. Content is truncated to audit.maxPayloadChars (default 2048) with a [TRUNCATED:<original-length>] marker.

Benign traffic and guard.unavailable events never carry a payload. If you need to disable capture entirely, set audit.savePayload: false — the audit record still carries binding, label, score, where, and tool.

Regex-based redaction is best-effort. It catches the most common secret shapes but is not a substitute for not logging sensitive contexts. That's why it's tied to the violation path only (already-suspicious traffic), not to every prompt.

How it works​

Field reference​

Classifier types​

Example — bundled promptGuard (default)​

Example — external classifier​

Attaching guards to an agent​

Scan targets​

Enforcement override — tighten only​

Replace semantics​

Status​

Metrics​

Audit events​

Flagged payload capture​