Skip to main content

Eval

An Eval is a quality test for an Agent. It runs a curated set of test cases (EvalFixtures) against an Agent template, scores the results — both deterministically and via an optional LLM judge — and gates risky changes behind those scores.

The data model has three pieces:

CRDShortWhat it represents
EvalevThe runner config: which Agent, which fixtures, which judge, pass threshold, optional cron
EvalFixtureefxOne specific test case (one prompt + assertions)
JudgeCalibrationjcalHuman-labeled examples that validate a judge model's agreement before it gates anything

Together they let you turn real production traffic into a permanent regression suite, then refuse to ship agent changes that would regress it.

Why this is here

Agents are non-deterministic. A change to the system prompt that fixes one case can silently regress five others, and "manual testing" doesn't catch it. Pai's eval surface gives you the same discipline you have for normal code: an auto-running suite that grows organically from production traces, scores results with a calibrated judge, and blocks edits that drop the score.

The flywheel:

production traffic → trace → promote → EvalFixture


EvalFixture is auto-included
in the per-agent Eval (label match)


Eval runs on edit / on cron / on demand


On regression: edit blocked, fixtures
listed inline so you see what broke

Quickstart

Apply a chat-only template, a fixture, and an Eval. The fixture's default labels match the Eval's selector, so the fixture is auto-included.

---
apiVersion: pai.io/v1
kind: Agent
metadata:
name: math-tutor
spec:
models: [anthropic/claude-haiku-4-5]
system: |
You are a math tutor. Answer in one short sentence.
Always include the literal text "answer:" before the result.
tools:
- { type: bash, enabled: false }
- { type: send_email, enabled: false }
---
apiVersion: pai.io/v1
kind: EvalFixture
metadata:
name: tutor-add
labels:
agent: math-tutor
spec:
targetAgent: math-tutor
kind: golden
severity: medium
description: Simple addition; tutor must prefix the result with "answer:".
input:
type: prompt
payload: { prompt: "What is 2 + 2?" }
assertions:
- { type: contains, value: "answer:" }
- { type: contains, value: "4" }
- { type: toolNotCalled, tool: "bash" }
- type: judge
kind: rubric
rubric: |
Did the agent answer correctly AND prefix the result with the
literal text "answer:" (lowercase)? Penalize verbosity.
minScore: 0.8
weight: 2.0
---
apiVersion: pai.io/v1
kind: Eval
metadata:
name: math-tutor-eval
spec:
targetAgent: math-tutor
fixtureSelector:
matchLabels: { agent: math-tutor }
judge:
model: anthropic/claude-haiku-4-5
temperature: 0
scoring:
passThreshold: 1.0
execution:
parallel: 3
timeoutSeconds: 120

Apply and run:

pai apply -f tutor.yaml
# Trigger the run from the web UI's "Run now" button, or via the API:
curl -X POST -H "Authorization: Bearer $TOKEN" \
https://api.pairun.dev/evals/math-tutor-eval/run
# Watch progress
pai get eval math-tutor-eval

A Pass result writes a Pass verdict to the Eval's status with per-fixture scores; a judge verdict (when applicable) shows up under each fixture in the UI with the rubric reasoning, violations, and a "what would raise the score" suggestion.

CLI

pai apply -f eval.yaml             # Create or update via YAML (multi-doc OK)
pai get evals # List Evals in your namespace
pai get eval <name> # Show one Eval's status
pai get eval-fixtures # List fixtures (kubectl alias: efx)
pai delete eval <name> # Delete an Eval (fixtures are not removed)
pai delete eval-fixture <name> # Delete a fixture

The web UI exposes the same operations plus a few flows that don't have CLI equivalents yet:

  • Promote a trace to a fixture (Agent detail → Trace tab → Promote to fixture)
    • New eval form that picks fixtures by label or explicit list
  • Eval gate on Save when editing a template

EvalFixture

A fixture captures one test case: one input, one set of assertions, one expected outcome.

apiVersion: pai.io/v1
kind: EvalFixture
metadata:
name: refund-must-clarify
labels:
agent: support-bot
scenario: refund
spec:
targetAgent: support-bot
kind: bad # golden | bad | edge
severity: high # low | medium | high | critical
description: Original Jane issued a 100% refund without asking why.
origin: # populated by Promote-from-trace
traceId: conv-support-cd7e649e
promotedAt: "2026-05-04T14:22:01Z"
promotedBy: amir@company.com
input:
type: webhook # webhook | prompt | telegram | slack | email
payload:
ticket: "I want a refund for order #12345"
assertions:
- { type: toolCalled, tool: clarify_reason, before: [create_jira_ticket] }
- { type: toolNotCalled, tool: send_email, after: [clarify_reason] }
- type: judge
kind: rubric
rubric: |
Did the agent ask the customer for the reason before creating any
ticket or drafting a refund email? Score 1 if yes, 0 if not.
minScore: 0.8

Fixture kind

spec.kind is a label that documents the fixture's intent. The runner doesn't treat the three kinds differently — they're informational, but the UI and promotion wizard use them.

KindWhen to use
goldenAgent did the right thing — assertions encode the reference behaviour
badAgent did the wrong thing — assertions guard against the regression
edgeCorner case worth covering, neither inherently good nor bad

Severity

spec.severity weights the fixture's contribution to the Eval's overall score. Default weights:

SeverityWeight
low0.5
medium1.0
high2.0
critical4.0

A single critical-severity failure can dominate the overall score. Configurable per Eval via spec.scoring.severityWeights.

Assertion types

TypePurposeRequired fields
toolCalledThe named tool was invoked. Optional before / after for ordering checkstool, optional before[], after[]
toolNotCalledThe named tool was not invokedtool
toolSequenceThe given tool order appears as a subsequence of the actual tool callstools[]
containsSubstring is present in the agent's final messagevalue, optional field
notContainsSubstring is absent from the final messagevalue, optional field
regexRegex pattern matches somewhere in the final messagepattern
costTotal tokens stayed below a capmaxTokens
judgeLLM-as-judge scoring against a rubric (see below)rubric, optional minScore, kind
latencyReserved (timestamp arithmetic deferred)

Each assertion can also set:

  • weight (default 1.0) — multiplier for this assertion's contribution to the fixture's combined score
  • severity — override the fixture-level severity for this assertion only

Tool replay and shadow

spec.toolReplay and spec.shadowTools are reserved fields on the schema for future use. They will let you record tool outputs and skip live execution for side-effecting tools during fixture runs. The runner accepts them today but ignores them — until they're honored by the harness, only run eval fixtures against agents whose tools are safe to live-execute (read-only or sandboxed).

Origin

When a fixture is promoted from a trace via the UI, spec.origin records the source trace ID, the promoter's identity, and the timestamp. Hand-written fixtures leave it blank.

Eval

The runner config — points at an agent, picks fixtures, configures the judge, sets pass threshold, optionally schedules the suite.

apiVersion: pai.io/v1
kind: Eval
metadata:
name: support-bot-eval
spec:
targetAgent: support-bot # Required — the Agent under test
fixtureSelector:
matchLabels: { agent: support-bot } # Auto-include matching fixtures
fixtures: # Optional explicit list
- refund-must-clarify
judge:
model: anthropic/claude-sonnet-4-6
temperature: 0
calibration: support-bot-judge # Optional — see JudgeCalibration
scoring:
passThreshold: 0.85
severityWeights:
critical: 8.0 # bump critical so one failure forces overall Fail
execution:
parallel: 5
timeoutSeconds: 300
repsPerFixture: 1 # Bump to >1 for variance control on noisy judges
schedule: "0 9 * * *" # Optional — cron, runs nightly at 09:00 UTC

Fixture selection

Two ways to pick fixtures, used in combination — the Eval runs the union:

  1. Label selector (recommended) — spec.fixtureSelector.matchLabels. Auto-includes every current and future fixture with matching labels. The promote-from-trace wizard's default labels make this Just Work.
  2. Explicit listspec.fixtures: [name, ...]. Pinned set; future promotions don't auto-join. Useful for scoping a small experiment.

Scoring

Per-fixture: each assertion contributes its weight × severity multiplier to a weighted sum. A fixture passes when every assertion passes. The fixture's score is the weighted-pass ratio.

Eval-level: the overall score is the severity-weighted pass rate across fixtures. The Eval's lastResult is Pass when overall ≥ passThreshold, otherwise Fail.

Schedule

Setting spec.schedule to a 5-field cron expression makes the controller manage a periodic run. The Eval's lastRunAt, lastResult, and fixtureResults always reflect the most recent run.

Status fields

FieldDescription
phasePending / Running / Passed / Failed / Stale (target Agent generation drifted)
lastResultPass / Fail
lastScoreOverall weighted score, formatted to 2dp
passingFixtures / failingFixtures / totalFixturesPer-run counts
fixtureResults[]Per-fixture verdicts: name, score, result, severity, message, runId, agent output, judge verdict (when applicable)
judgeAgreementCohen's κ from the latest JudgeCalibration replay (slice 2.x)
targetAgentGenerationThe target Agent's metadata.generation at the last run

Judge

Judge assertions are scored by an LLM. The Eval's spec.judge.model chooses the judge; each judge assertion supplies its own rubric. The runner builds a strict-JSON prompt with the rubric, the user's input, and a numbered transcript of the agent's run, then expects a verdict shaped like:

{
"score": 0.4,
"confidence": 0.92,
"summary": "Agent issued a refund without asking why.",
"violations": [
{
"rule": "ask_reason_before_refund",
"severity": "high",
"evidence_step": 4,
"quote": "I'll go ahead and create a refund ticket"
}
],
"what_would_raise_score": "Insert a clarify_reason tool call before any refund-related tool."
}

The runner parses the JSON, stores the verdict on the fixture's row in status.fixtureResults[].judge, and surfaces it in the UI under each fixture's "Show details" twisty — including evidence step pointers that link back to the corresponding event in the trace timeline.

Why ask for evidence_step

Forcing the judge to cite specific trace steps for every violation makes the verdict auditable. The UI uses these to highlight the cited step; an unsupported violation (no evidence_step, or pointing at a non-existent step) is the easiest tell that the judge is hallucinating.

Choosing a judge model

Any model exposed by your namespace's ModelProviders works. Common patterns:

  • claude-haiku-4-5 for routine rubrics — fast, cheap, sufficient for most behavioural checks
  • claude-sonnet-4-6 for safety-critical rubrics (refund decisions, legal compliance)
  • gemini-2.5-flash as a cross-vendor sanity check on top of an Anthropic primary judge

The web UI's New eval form populates a dropdown from your configured ModelProviders.

JudgeCalibration

A JudgeCalibration is a small set of human-labeled examples that validate a judge configuration before it gates anything. The runner replays the judge against each example, compares its score to the human label, and computes Cohen's κ (binarized at 0.5). The Eval's status.judgeAgreement surfaces the result, so an Eval can refuse to gate when its judge agreement is below threshold.

apiVersion: pai.io/v1
kind: JudgeCalibration
metadata:
name: support-bot-judge
spec:
rubric: |
The agent should ask the customer for the reason before issuing a refund.
judgeModels:
- anthropic/claude-haiku-4-5
- anthropic/claude-sonnet-4-6
examples:
- traceRef: conv-support-cd7e649e
input:
ticket: "I want a refund for order 12345"
candidateOutput: "I'll create a refund ticket right away."
humanScore: 0.0
humanLabel: Did not clarify, issued refund directly
disagreementReason: rubricWrong # optional — set when imported from a Challenge
- traceRef: conv-support-fa1c2b3
input:
ticket: "I want a refund for order 99812"
candidateOutput: "Could you tell me why you'd like to return it?"
humanScore: 1.0
humanLabel: Asked the right clarifying question

Replay the calibration:

curl -X POST -H "Authorization: Bearer $TOKEN" \
https://api.pairun.dev/judge-calibrations/support-bot-judge/replay

Or click Replay in the UI under Settings → Judge calibrations. The replay populates:

FieldMeaning
status.agreementBest-performing model's κ, formatted to 2dp
status.agreementByModel{}Per-model κ
status.judgesEvaluated[]Models actually replayed
status.phaseCalibrated (κ ≥ 0.6), Stale (κ < 0.6 but at least one example scored), Failed (no examples scored)

Production-data flywheel

The point of all this is to grow your fixture set from real traffic without writing tests by hand.

  1. Capture — every task agent run becomes a Trace. The Trace tab on the agent shows the full timeline of LLM calls, tool calls, and final output.
  2. Promote — when a trace is interesting (a regression, a new edge case, a positive example), click Promote to fixture. The wizard pre-fills the input from the trace, suggests toolCalled assertions for every tool the agent used, and lets you add a judge rubric inline.
  3. Auto-include — promoted fixtures are labeled agent=<targetAgent> by default, which the Eval's fixtureSelector already matches. The next Run now picks them up.
  4. Gate — when the agent owner edits the template (system prompt, tools, model), the Eval runs against the candidate spec before the patch is applied. Failures block the edit; an explicit "Save anyway" override is available and audit-logged.

Eval gate (on edit)

When you edit a template Agent in the web UI, Save now flows through POST /agent-defs/{name}/eval-and-update:

  1. The apiserver creates a shadow Agent {name}-cand-{run} with the candidate spec
  2. For each Eval whose targetAgent == name, it creates a candidate Eval pointed at the shadow (schedule cleared so cron evals don't fire mid-gate)
  3. The runner executes the candidate Evals against the shadow
  4. All-Pass → the patch is applied to the original Agent. Any-Fail → the original is left untouched and the failing fixtures are surfaced in the UI
  5. Always: shadow + candidate Evals are torn down

If you need to ship a known-bad change in an emergency, the failure dialog has a Save anyway button that re-issues the request with ?force=true. The bypass is logged to stdout with the actor name; the audit chain entry is structured so it can be alerted on.

Limits

LimitValueWhere it bites
Fixtures per EvalunboundedA 1000-fixture eval will be slow — use parallelism + repsPerFixture wisely
Judge timeout60s per callConfigurable in source; rare to hit
Output captured per fixture8 KiBTruncated for status field; full event log lives in the trace
Judge reasoning stored4 KiBTruncated in status; full prose in the gateway audit chain
Judge violation list10 entriesExtra violations dropped from the stored verdict
repsPerFixture per EvalunboundedEach rep is a separate Session — multiplies cost linearly

Status, troubleshooting

SymptomLikely cause
Eval phase: StaleThe target Agent's metadata.generation advanced since the last run. Run again to refresh.
Fixture row SkipThe runner couldn't reach a terminal status event before timeout. Bump spec.execution.timeoutSeconds.
Judge assertions all "skipped"spec.judge.model is unset on the Eval, or your judge's ModelProvider is missing. Set it in the Eval YAML or the New eval form's dropdown.
Fixture's output is emptySession timed out before the agent emitted a final agent.message. Check the trace timeline for an error or tool-loop.
New trace promotion didn't appear in the EvalThe promoted fixture's labels don't match the Eval's fixtureSelector. The wizard's default is agent=<targetAgent> — verify on the Fixtures section of the per-agent Evals tab.

See also

  • Agent — template Agents that Evals target
  • Model Provider — where the judge model comes from
  • Memory — persistent state for stateful agents