Eval
An Eval is a quality test for an Agent. It runs a curated set of test cases (EvalFixtures) against an Agent template, scores the results — both deterministically and via an optional LLM judge — and gates risky changes behind those scores.
The data model has three pieces:
| CRD | Short | What it represents |
|---|---|---|
| Eval | ev | The runner config: which Agent, which fixtures, which judge, pass threshold, optional cron |
| EvalFixture | efx | One specific test case (one prompt + assertions) |
| JudgeCalibration | jcal | Human-labeled examples that validate a judge model's agreement before it gates anything |
Together they let you turn real production traffic into a permanent regression suite, then refuse to ship agent changes that would regress it.
Why this is here
Agents are non-deterministic. A change to the system prompt that fixes one case can silently regress five others, and "manual testing" doesn't catch it. Pai's eval surface gives you the same discipline you have for normal code: an auto-running suite that grows organically from production traces, scores results with a calibrated judge, and blocks edits that drop the score.
The flywheel:
production traffic → trace → promote → EvalFixture
│
▼
EvalFixture is auto-included
in the per-agent Eval (label match)
│
▼
Eval runs on edit / on cron / on demand
│
▼
On regression: edit blocked, fixtures
listed inline so you see what broke
Quickstart
Apply a chat-only template, a fixture, and an Eval. The fixture's default labels match the Eval's selector, so the fixture is auto-included.
---
apiVersion: pai.io/v1
kind: Agent
metadata:
name: math-tutor
spec:
models: [anthropic/claude-haiku-4-5]
system: |
You are a math tutor. Answer in one short sentence.
Always include the literal text "answer:" before the result.
tools:
- { type: bash, enabled: false }
- { type: send_email, enabled: false }
---
apiVersion: pai.io/v1
kind: EvalFixture
metadata:
name: tutor-add
labels:
agent: math-tutor
spec:
targetAgent: math-tutor
kind: golden
severity: medium
description: Simple addition; tutor must prefix the result with "answer:".
input:
type: prompt
payload: { prompt: "What is 2 + 2?" }
assertions:
- { type: contains, value: "answer:" }
- { type: contains, value: "4" }
- { type: toolNotCalled, tool: "bash" }
- type: judge
kind: rubric
rubric: |
Did the agent answer correctly AND prefix the result with the
literal text "answer:" (lowercase)? Penalize verbosity.
minScore: 0.8
weight: 2.0
---
apiVersion: pai.io/v1
kind: Eval
metadata:
name: math-tutor-eval
spec:
targetAgent: math-tutor
fixtureSelector:
matchLabels: { agent: math-tutor }
judge:
model: anthropic/claude-haiku-4-5
temperature: 0
scoring:
passThreshold: 1.0
execution:
parallel: 3
timeoutSeconds: 120
Apply and run:
pai apply -f tutor.yaml
# Trigger the run from the web UI's "Run now" button, or via the API:
curl -X POST -H "Authorization: Bearer $TOKEN" \
https://api.pairun.dev/evals/math-tutor-eval/run
# Watch progress
pai get eval math-tutor-eval
A Pass result writes a Pass verdict to the Eval's status with per-fixture scores; a judge verdict (when applicable) shows up under each fixture in the UI with the rubric reasoning, violations, and a "what would raise the score" suggestion.
CLI
pai apply -f eval.yaml # Create or update via YAML (multi-doc OK)
pai get evals # List Evals in your namespace
pai get eval <name> # Show one Eval's status
pai get eval-fixtures # List fixtures (kubectl alias: efx)
pai delete eval <name> # Delete an Eval (fixtures are not removed)
pai delete eval-fixture <name> # Delete a fixture
The web UI exposes the same operations plus a few flows that don't have CLI equivalents yet:
- Promote a trace to a fixture (Agent detail → Trace tab → Promote to fixture)
-
- New eval form that picks fixtures by label or explicit list
- Eval gate on Save when editing a template
EvalFixture
A fixture captures one test case: one input, one set of assertions, one expected outcome.
apiVersion: pai.io/v1
kind: EvalFixture
metadata:
name: refund-must-clarify
labels:
agent: support-bot
scenario: refund
spec:
targetAgent: support-bot
kind: bad # golden | bad | edge
severity: high # low | medium | high | critical
description: Original Jane issued a 100% refund without asking why.
origin: # populated by Promote-from-trace
traceId: conv-support-cd7e649e
promotedAt: "2026-05-04T14:22:01Z"
promotedBy: amir@company.com
input:
type: webhook # webhook | prompt | telegram | slack | email
payload:
ticket: "I want a refund for order #12345"
assertions:
- { type: toolCalled, tool: clarify_reason, before: [create_jira_ticket] }
- { type: toolNotCalled, tool: send_email, after: [clarify_reason] }
- type: judge
kind: rubric
rubric: |
Did the agent ask the customer for the reason before creating any
ticket or drafting a refund email? Score 1 if yes, 0 if not.
minScore: 0.8
Fixture kind
spec.kind is a label that documents the fixture's intent. The runner doesn't treat the three kinds differently — they're informational, but the UI and promotion wizard use them.
| Kind | When to use |
|---|---|
golden | Agent did the right thing — assertions encode the reference behaviour |
bad | Agent did the wrong thing — assertions guard against the regression |
edge | Corner case worth covering, neither inherently good nor bad |
Severity
spec.severity weights the fixture's contribution to the Eval's overall score. Default weights:
| Severity | Weight |
|---|---|
low | 0.5 |
medium | 1.0 |
high | 2.0 |
critical | 4.0 |
A single critical-severity failure can dominate the overall score. Configurable per Eval via spec.scoring.severityWeights.
Assertion types
| Type | Purpose | Required fields |
|---|---|---|
toolCalled | The named tool was invoked. Optional before / after for ordering checks | tool, optional before[], after[] |
toolNotCalled | The named tool was not invoked | tool |
toolSequence | The given tool order appears as a subsequence of the actual tool calls | tools[] |
contains | Substring is present in the agent's final message | value, optional field |
notContains | Substring is absent from the final message | value, optional field |
regex | Regex pattern matches somewhere in the final message | pattern |
cost | Total tokens stayed below a cap | maxTokens |
judge | LLM-as-judge scoring against a rubric (see below) | rubric, optional minScore, kind |
latency | Reserved (timestamp arithmetic deferred) | — |
Each assertion can also set:
weight(default1.0) — multiplier for this assertion's contribution to the fixture's combined scoreseverity— override the fixture-level severity for this assertion only
Tool replay and shadow
spec.toolReplay and spec.shadowTools are reserved fields on the schema for future use. They will let you record tool outputs and skip live execution for side-effecting tools during fixture runs. The runner accepts them today but ignores them — until they're honored by the harness, only run eval fixtures against agents whose tools are safe to live-execute (read-only or sandboxed).
Origin
When a fixture is promoted from a trace via the UI, spec.origin records the source trace ID, the promoter's identity, and the timestamp. Hand-written fixtures leave it blank.
Eval
The runner config — points at an agent, picks fixtures, configures the judge, sets pass threshold, optionally schedules the suite.
apiVersion: pai.io/v1
kind: Eval
metadata:
name: support-bot-eval
spec:
targetAgent: support-bot # Required — the Agent under test
fixtureSelector:
matchLabels: { agent: support-bot } # Auto-include matching fixtures
fixtures: # Optional explicit list
- refund-must-clarify
judge:
model: anthropic/claude-sonnet-4-6
temperature: 0
calibration: support-bot-judge # Optional — see JudgeCalibration
scoring:
passThreshold: 0.85
severityWeights:
critical: 8.0 # bump critical so one failure forces overall Fail
execution:
parallel: 5
timeoutSeconds: 300
repsPerFixture: 1 # Bump to >1 for variance control on noisy judges
schedule: "0 9 * * *" # Optional — cron, runs nightly at 09:00 UTC
Fixture selection
Two ways to pick fixtures, used in combination — the Eval runs the union:
- Label selector (recommended) —
spec.fixtureSelector.matchLabels. Auto-includes every current and future fixture with matching labels. The promote-from-trace wizard's default labels make this Just Work. - Explicit list —
spec.fixtures: [name, ...]. Pinned set; future promotions don't auto-join. Useful for scoping a small experiment.
Scoring
Per-fixture: each assertion contributes its weight × severity multiplier to a weighted sum. A fixture passes when every assertion passes. The fixture's score is the weighted-pass ratio.
Eval-level: the overall score is the severity-weighted pass rate across fixtures. The Eval's lastResult is Pass when overall ≥ passThreshold, otherwise Fail.
Schedule
Setting spec.schedule to a 5-field cron expression makes the controller manage a periodic run. The Eval's lastRunAt, lastResult, and fixtureResults always reflect the most recent run.
Status fields
| Field | Description |
|---|---|
phase | Pending / Running / Passed / Failed / Stale (target Agent generation drifted) |
lastResult | Pass / Fail |
lastScore | Overall weighted score, formatted to 2dp |
passingFixtures / failingFixtures / totalFixtures | Per-run counts |
fixtureResults[] | Per-fixture verdicts: name, score, result, severity, message, runId, agent output, judge verdict (when applicable) |
judgeAgreement | Cohen's κ from the latest JudgeCalibration replay (slice 2.x) |
targetAgentGeneration | The target Agent's metadata.generation at the last run |
Judge
Judge assertions are scored by an LLM. The Eval's spec.judge.model chooses the judge; each judge assertion supplies its own rubric. The runner builds a strict-JSON prompt with the rubric, the user's input, and a numbered transcript of the agent's run, then expects a verdict shaped like:
{
"score": 0.4,
"confidence": 0.92,
"summary": "Agent issued a refund without asking why.",
"violations": [
{
"rule": "ask_reason_before_refund",
"severity": "high",
"evidence_step": 4,
"quote": "I'll go ahead and create a refund ticket"
}
],
"what_would_raise_score": "Insert a clarify_reason tool call before any refund-related tool."
}
The runner parses the JSON, stores the verdict on the fixture's row in status.fixtureResults[].judge, and surfaces it in the UI under each fixture's "Show details" twisty — including evidence step pointers that link back to the corresponding event in the trace timeline.
Why ask for evidence_step
Forcing the judge to cite specific trace steps for every violation makes the verdict auditable. The UI uses these to highlight the cited step; an unsupported violation (no evidence_step, or pointing at a non-existent step) is the easiest tell that the judge is hallucinating.
Choosing a judge model
Any model exposed by your namespace's ModelProviders works. Common patterns:
claude-haiku-4-5for routine rubrics — fast, cheap, sufficient for most behavioural checksclaude-sonnet-4-6for safety-critical rubrics (refund decisions, legal compliance)gemini-2.5-flashas a cross-vendor sanity check on top of an Anthropic primary judge
The web UI's New eval form populates a dropdown from your configured ModelProviders.
JudgeCalibration
A JudgeCalibration is a small set of human-labeled examples that validate a judge configuration before it gates anything. The runner replays the judge against each example, compares its score to the human label, and computes Cohen's κ (binarized at 0.5). The Eval's status.judgeAgreement surfaces the result, so an Eval can refuse to gate when its judge agreement is below threshold.
apiVersion: pai.io/v1
kind: JudgeCalibration
metadata:
name: support-bot-judge
spec:
rubric: |
The agent should ask the customer for the reason before issuing a refund.
judgeModels:
- anthropic/claude-haiku-4-5
- anthropic/claude-sonnet-4-6
examples:
- traceRef: conv-support-cd7e649e
input:
ticket: "I want a refund for order 12345"
candidateOutput: "I'll create a refund ticket right away."
humanScore: 0.0
humanLabel: Did not clarify, issued refund directly
disagreementReason: rubricWrong # optional — set when imported from a Challenge
- traceRef: conv-support-fa1c2b3
input:
ticket: "I want a refund for order 99812"
candidateOutput: "Could you tell me why you'd like to return it?"
humanScore: 1.0
humanLabel: Asked the right clarifying question
Replay the calibration:
curl -X POST -H "Authorization: Bearer $TOKEN" \
https://api.pairun.dev/judge-calibrations/support-bot-judge/replay
Or click Replay in the UI under Settings → Judge calibrations. The replay populates:
| Field | Meaning |
|---|---|
status.agreement | Best-performing model's κ, formatted to 2dp |
status.agreementByModel{} | Per-model κ |
status.judgesEvaluated[] | Models actually replayed |
status.phase | Calibrated (κ ≥ 0.6), Stale (κ < 0.6 but at least one example scored), Failed (no examples scored) |
Production-data flywheel
The point of all this is to grow your fixture set from real traffic without writing tests by hand.
- Capture — every task agent run becomes a Trace. The Trace tab on the agent shows the full timeline of LLM calls, tool calls, and final output.
- Promote — when a trace is interesting (a regression, a new edge case, a positive example), click Promote to fixture. The wizard pre-fills the input from the trace, suggests
toolCalledassertions for every tool the agent used, and lets you add a judge rubric inline. - Auto-include — promoted fixtures are labeled
agent=<targetAgent>by default, which the Eval'sfixtureSelectoralready matches. The next Run now picks them up. - Gate — when the agent owner edits the template (system prompt, tools, model), the Eval runs against the candidate spec before the patch is applied. Failures block the edit; an explicit "Save anyway" override is available and audit-logged.
Eval gate (on edit)
When you edit a template Agent in the web UI, Save now flows through POST /agent-defs/{name}/eval-and-update:
- The apiserver creates a shadow Agent
{name}-cand-{run}with the candidate spec - For each Eval whose
targetAgent == name, it creates a candidate Eval pointed at the shadow (schedule cleared so cron evals don't fire mid-gate) - The runner executes the candidate Evals against the shadow
- All-Pass → the patch is applied to the original Agent. Any-Fail → the original is left untouched and the failing fixtures are surfaced in the UI
- Always: shadow + candidate Evals are torn down
If you need to ship a known-bad change in an emergency, the failure dialog has a Save anyway button that re-issues the request with ?force=true. The bypass is logged to stdout with the actor name; the audit chain entry is structured so it can be alerted on.
Limits
| Limit | Value | Where it bites |
|---|---|---|
| Fixtures per Eval | unbounded | A 1000-fixture eval will be slow — use parallelism + repsPerFixture wisely |
| Judge timeout | 60s per call | Configurable in source; rare to hit |
| Output captured per fixture | 8 KiB | Truncated for status field; full event log lives in the trace |
| Judge reasoning stored | 4 KiB | Truncated in status; full prose in the gateway audit chain |
| Judge violation list | 10 entries | Extra violations dropped from the stored verdict |
repsPerFixture per Eval | unbounded | Each rep is a separate Session — multiplies cost linearly |
Status, troubleshooting
| Symptom | Likely cause |
|---|---|
Eval phase: Stale | The target Agent's metadata.generation advanced since the last run. Run again to refresh. |
Fixture row Skip | The runner couldn't reach a terminal status event before timeout. Bump spec.execution.timeoutSeconds. |
| Judge assertions all "skipped" | spec.judge.model is unset on the Eval, or your judge's ModelProvider is missing. Set it in the Eval YAML or the New eval form's dropdown. |
Fixture's output is empty | Session timed out before the agent emitted a final agent.message. Check the trace timeline for an error or tool-loop. |
| New trace promotion didn't appear in the Eval | The promoted fixture's labels don't match the Eval's fixtureSelector. The wizard's default is agent=<targetAgent> — verify on the Fixtures section of the per-agent Evals tab. |
See also
- Agent — template Agents that Evals target
- Model Provider — where the judge model comes from
- Memory — persistent state for stateful agents