Eval

An Eval is a quality test for an Agent. It runs a curated set of test cases (EvalFixtures) against an Agent template, scores the results — both deterministically and via an optional LLM judge — and gates risky changes behind those scores.

The data model has three pieces:

CRD	Short	What it represents
Eval	`ev`	The runner config: which Agent, which fixtures, which judge, pass threshold, optional cron
EvalFixture	`efx`	One specific test case (one prompt + assertions)
JudgeCalibration	`jcal`	Human-labeled examples that validate a judge model's agreement before it gates anything

Together they let you turn real production traffic into a permanent regression suite, then refuse to ship agent changes that would regress it.

Why this is here

Agents are non-deterministic. A change to the system prompt that fixes one case can silently regress five others, and "manual testing" doesn't catch it. Pai's eval surface gives you the same discipline you have for normal code: an auto-running suite that grows organically from production traces, scores results with a calibrated judge, and blocks edits that drop the score.

The flywheel:

production traffic → trace → promote → EvalFixture
                                          │
                                          ▼
                                EvalFixture is auto-included
                                in the per-agent Eval (label match)
                                          │
                                          ▼
                              Eval runs on edit / on cron / on demand
                                          │
                                          ▼
                              On regression: edit blocked, fixtures
                              listed inline so you see what broke

Quickstart

Apply a chat-only template, a fixture, and an Eval. The fixture's default labels match the Eval's selector, so the fixture is auto-included.

---
apiVersion: pai.io/v1
kind: Agent
metadata:
  name: math-tutor
spec:
  models: [anthropic/claude-haiku-4-5]
  system: |
    You are a math tutor. Answer in one short sentence.
    Always include the literal text "answer:" before the result.
  tools:
    - { type: bash, enabled: false }
    - { type: send_email, enabled: false }
---
apiVersion: pai.io/v1
kind: EvalFixture
metadata:
  name: tutor-add
  labels:
    agent: math-tutor
spec:
  targetAgent: math-tutor
  kind: golden
  severity: medium
  description: Simple addition; tutor must prefix the result with "answer:".
  input:
    type: prompt
    payload: { prompt: "What is 2 + 2?" }
  assertions:
    - { type: contains, value: "answer:" }
    - { type: contains, value: "4" }
    - { type: toolNotCalled, tool: "bash" }
    - type: judge
      kind: rubric
      rubric: |
        Did the agent answer correctly AND prefix the result with the
        literal text "answer:" (lowercase)? Penalize verbosity.
      minScore: 0.8
      weight: 2.0
---
apiVersion: pai.io/v1
kind: Eval
metadata:
  name: math-tutor-eval
spec:
  targetAgent: math-tutor
  fixtureSelector:
    matchLabels: { agent: math-tutor }
  judge:
    model: anthropic/claude-haiku-4-5
    temperature: 0
  scoring:
    passThreshold: 1.0
  execution:
    parallel: 3
    timeoutSeconds: 120

Apply and run:

pai apply -f tutor.yaml
# Trigger the run from the web UI's "Run now" button, or via the API:
curl -X POST -H "Authorization: Bearer $TOKEN" \
  https://api.pairun.dev/evals/math-tutor-eval/run
# Watch progress
pai get eval math-tutor-eval

A Pass result writes a Pass verdict to the Eval's status with per-fixture scores; a judge verdict (when applicable) shows up under each fixture in the UI with the rubric reasoning, violations, and a "what would raise the score" suggestion.

CLI

pai apply -f eval.yaml             # Create or update via YAML (multi-doc OK)
pai get evals                      # List Evals in your namespace
pai get eval <name>                # Show one Eval's status
pai get eval-fixtures              # List fixtures (kubectl alias: efx)
pai delete eval <name>             # Delete an Eval (fixtures are not removed)
pai delete eval-fixture <name>     # Delete a fixture

The web UI exposes the same operations plus a few flows that don't have CLI equivalents yet:

Promote a trace to a fixture (Agent detail → Trace tab → Promote to fixture)
- New eval form that picks fixtures by label or explicit list
Eval gate on Save when editing a template

EvalFixture

A fixture captures one test case: one input, one set of assertions, one expected outcome.

apiVersion: pai.io/v1
kind: EvalFixture
metadata:
  name: refund-must-clarify
  labels:
    agent: support-bot
    scenario: refund
spec:
  targetAgent: support-bot
  kind: bad                        # golden | bad | edge
  severity: high                   # low | medium | high | critical
  description: Original Jane issued a 100% refund without asking why.
  origin:                          # populated by Promote-from-trace
    traceId: conv-support-cd7e649e
    promotedAt: "2026-05-04T14:22:01Z"
    promotedBy: amir@company.com
  input:
    type: webhook                  # webhook | prompt | telegram | slack | email
    payload:
      ticket: "I want a refund for order #12345"
  assertions:
    - { type: toolCalled, tool: clarify_reason, before: [create_jira_ticket] }
    - { type: toolNotCalled, tool: send_email, after: [clarify_reason] }
    - type: judge
      kind: rubric
      rubric: |
        Did the agent ask the customer for the reason before creating any
        ticket or drafting a refund email? Score 1 if yes, 0 if not.
      minScore: 0.8

Fixture kind

spec.kind is a label that documents the fixture's intent. The runner doesn't treat the three kinds differently — they're informational, but the UI and promotion wizard use them.

Kind	When to use
`golden`	Agent did the right thing — assertions encode the reference behaviour
`bad`	Agent did the wrong thing — assertions guard against the regression
`edge`	Corner case worth covering, neither inherently good nor bad

Severity

spec.severity weights the fixture's contribution to the Eval's overall score. Default weights:

Severity	Weight
`low`	0.5
`medium`	1.0
`high`	2.0
`critical`	4.0

A single critical-severity failure can dominate the overall score. Configurable per Eval via spec.scoring.severityWeights.

Assertion types

Type	Purpose	Required fields
`toolCalled`	The named tool was invoked. Optional `before` / `after` for ordering checks	`tool`, optional `before[]`, `after[]`
`toolNotCalled`	The named tool was not invoked	`tool`
`toolSequence`	The given tool order appears as a subsequence of the actual tool calls	`tools[]`
`contains`	Substring is present in the agent's final message	`value`, optional `field`
`notContains`	Substring is absent from the final message	`value`, optional `field`
`regex`	Regex pattern matches somewhere in the final message	`pattern`
`cost`	Total tokens stayed below a cap	`maxTokens`
`judge`	LLM-as-judge scoring against a rubric (see below)	`rubric`, optional `minScore`, `kind`
`latency`	Reserved (timestamp arithmetic deferred)	—

Each assertion can also set:

weight (default 1.0) — multiplier for this assertion's contribution to the fixture's combined score
severity — override the fixture-level severity for this assertion only

Tool replay and shadow

spec.toolReplay and spec.shadowTools are reserved fields on the schema for future use. They will let you record tool outputs and skip live execution for side-effecting tools during fixture runs. The runner accepts them today but ignores them — until they're honored by the harness, only run eval fixtures against agents whose tools are safe to live-execute (read-only or sandboxed).

Origin

When a fixture is promoted from a trace via the UI, spec.origin records the source trace ID, the promoter's identity, and the timestamp. Hand-written fixtures leave it blank.

Eval

The runner config — points at an agent, picks fixtures, configures the judge, sets pass threshold, optionally schedules the suite.

apiVersion: pai.io/v1
kind: Eval
metadata:
  name: support-bot-eval
spec:
  targetAgent: support-bot         # Required — the Agent under test
  fixtureSelector:
    matchLabels: { agent: support-bot }     # Auto-include matching fixtures
  fixtures:                                  # Optional explicit list
    - refund-must-clarify
  judge:
    model: anthropic/claude-sonnet-4-6
    temperature: 0
    calibration: support-bot-judge          # Optional — see JudgeCalibration
  scoring:
    passThreshold: 0.85
    severityWeights:
      critical: 8.0    # bump critical so one failure forces overall Fail
  execution:
    parallel: 5
    timeoutSeconds: 300
    repsPerFixture: 1     # Bump to >1 for variance control on noisy judges
  schedule: "0 9 * * *"   # Optional — cron, runs nightly at 09:00 UTC

Fixture selection

Two ways to pick fixtures, used in combination — the Eval runs the union:

Label selector (recommended) — spec.fixtureSelector.matchLabels. Auto-includes every current and future fixture with matching labels. The promote-from-trace wizard's default labels make this Just Work.
Explicit list — spec.fixtures: [name, ...]. Pinned set; future promotions don't auto-join. Useful for scoping a small experiment.

Scoring

Per-fixture: each assertion contributes its weight × severity multiplier to a weighted sum. A fixture passes when every assertion passes. The fixture's score is the weighted-pass ratio.

Eval-level: the overall score is the severity-weighted pass rate across fixtures. The Eval's lastResult is Pass when overall ≥ passThreshold, otherwise Fail.

Schedule

Setting spec.schedule to a 5-field cron expression makes the controller manage a periodic run. The Eval's lastRunAt, lastResult, and fixtureResults always reflect the most recent run.

Status fields

Field	Description
`phase`	`Pending` / `Running` / `Passed` / `Failed` / `Stale` (target Agent generation drifted)
`lastResult`	`Pass` / `Fail`
`lastScore`	Overall weighted score, formatted to 2dp
`passingFixtures` / `failingFixtures` / `totalFixtures`	Per-run counts
`fixtureResults[]`	Per-fixture verdicts: name, score, result, severity, message, runId, agent output, judge verdict (when applicable)
`judgeAgreement`	Cohen's κ from the latest JudgeCalibration replay (slice 2.x)
`targetAgentGeneration`	The target Agent's `metadata.generation` at the last run

Judge

Judge assertions are scored by an LLM. The Eval's spec.judge.model chooses the judge; each judge assertion supplies its own rubric. The runner builds a strict-JSON prompt with the rubric, the user's input, and a numbered transcript of the agent's run, then expects a verdict shaped like:

{
  "score": 0.4,
  "confidence": 0.92,
  "summary": "Agent issued a refund without asking why.",
  "violations": [
    {
      "rule": "ask_reason_before_refund",
      "severity": "high",
      "evidence_step": 4,
      "quote": "I'll go ahead and create a refund ticket"
    }
  ],
  "what_would_raise_score": "Insert a clarify_reason tool call before any refund-related tool."
}

The runner parses the JSON, stores the verdict on the fixture's row in status.fixtureResults[].judge, and surfaces it in the UI under each fixture's "Show details" twisty — including evidence step pointers that link back to the corresponding event in the trace timeline.

Why ask for evidence_step

Forcing the judge to cite specific trace steps for every violation makes the verdict auditable. The UI uses these to highlight the cited step; an unsupported violation (no evidence_step, or pointing at a non-existent step) is the easiest tell that the judge is hallucinating.

Choosing a judge model

Any model exposed by your namespace's ModelProviders works. Common patterns:

claude-haiku-4-5 for routine rubrics — fast, cheap, sufficient for most behavioural checks
claude-sonnet-4-6 for safety-critical rubrics (refund decisions, legal compliance)
gemini-2.5-flash as a cross-vendor sanity check on top of an Anthropic primary judge

The web UI's New eval form populates a dropdown from your configured ModelProviders.

JudgeCalibration

A JudgeCalibration is a small set of human-labeled examples that validate a judge configuration before it gates anything. The runner replays the judge against each example, compares its score to the human label, and computes Cohen's κ (binarized at 0.5). The Eval's status.judgeAgreement surfaces the result, so an Eval can refuse to gate when its judge agreement is below threshold.

apiVersion: pai.io/v1
kind: JudgeCalibration
metadata:
  name: support-bot-judge
spec:
  rubric: |
    The agent should ask the customer for the reason before issuing a refund.
  judgeModels:
    - anthropic/claude-haiku-4-5
    - anthropic/claude-sonnet-4-6
  examples:
    - traceRef: conv-support-cd7e649e
      input:
        ticket: "I want a refund for order 12345"
      candidateOutput: "I'll create a refund ticket right away."
      humanScore: 0.0
      humanLabel: Did not clarify, issued refund directly
      disagreementReason: rubricWrong   # optional — set when imported from a Challenge
    - traceRef: conv-support-fa1c2b3
      input:
        ticket: "I want a refund for order 99812"
      candidateOutput: "Could you tell me why you'd like to return it?"
      humanScore: 1.0
      humanLabel: Asked the right clarifying question

Replay the calibration:

curl -X POST -H "Authorization: Bearer $TOKEN" \
  https://api.pairun.dev/judge-calibrations/support-bot-judge/replay

Or click Replay in the UI under Settings → Judge calibrations. The replay populates:

Field	Meaning
`status.agreement`	Best-performing model's κ, formatted to 2dp
`status.agreementByModel{}`	Per-model κ
`status.judgesEvaluated[]`	Models actually replayed
`status.phase`	`Calibrated` (κ ≥ 0.6), `Stale` (κ < 0.6 but at least one example scored), `Failed` (no examples scored)

Production-data flywheel

The point of all this is to grow your fixture set from real traffic without writing tests by hand.

Capture — every task agent run becomes a Trace. The Trace tab on the agent shows the full timeline of LLM calls, tool calls, and final output.
Promote — when a trace is interesting (a regression, a new edge case, a positive example), click Promote to fixture. The wizard pre-fills the input from the trace, suggests toolCalled assertions for every tool the agent used, and lets you add a judge rubric inline.
Auto-include — promoted fixtures are labeled agent=<targetAgent> by default, which the Eval's fixtureSelector already matches. The next Run now picks them up.
Gate — when the agent owner edits the template (system prompt, tools, model), the Eval runs against the candidate spec before the patch is applied. Failures block the edit; an explicit "Save anyway" override is available and audit-logged.

Eval gate (on edit)

When you edit a template Agent in the web UI, Save now flows through POST /agent-defs/{name}/eval-and-update:

The apiserver creates a shadow Agent {name}-cand-{run} with the candidate spec
For each Eval whose targetAgent == name, it creates a candidate Eval pointed at the shadow (schedule cleared so cron evals don't fire mid-gate)
The runner executes the candidate Evals against the shadow
All-Pass → the patch is applied to the original Agent. Any-Fail → the original is left untouched and the failing fixtures are surfaced in the UI
Always: shadow + candidate Evals are torn down

If you need to ship a known-bad change in an emergency, the failure dialog has a Save anyway button that re-issues the request with ?force=true. The bypass is logged to stdout with the actor name; the audit chain entry is structured so it can be alerted on.

Limits

Limit	Value	Where it bites
Fixtures per Eval	unbounded	A 1000-fixture eval will be slow — use parallelism + repsPerFixture wisely
Judge timeout	60s per call	Configurable in source; rare to hit
Output captured per fixture	8 KiB	Truncated for status field; full event log lives in the trace
Judge reasoning stored	4 KiB	Truncated in status; full prose in the gateway audit chain
Judge violation list	10 entries	Extra violations dropped from the stored verdict
`repsPerFixture` per Eval	unbounded	Each rep is a separate Session — multiplies cost linearly

Status, troubleshooting

Symptom	Likely cause
Eval `phase: Stale`	The target Agent's `metadata.generation` advanced since the last run. Run again to refresh.
Fixture row `Skip`	The runner couldn't reach a terminal status event before timeout. Bump `spec.execution.timeoutSeconds`.
Judge assertions all "skipped"	`spec.judge.model` is unset on the Eval, or your judge's ModelProvider is missing. Set it in the Eval YAML or the New eval form's dropdown.
Fixture's `output` is empty	Session timed out before the agent emitted a final `agent.message`. Check the trace timeline for an error or tool-loop.
New trace promotion didn't appear in the Eval	The promoted fixture's labels don't match the Eval's `fixtureSelector`. The wizard's default is `agent=<targetAgent>` — verify on the Fixtures section of the per-agent Evals tab.

Why this is here​

Quickstart​

CLI​

EvalFixture​

Fixture kind​

Severity​

Assertion types​

Tool replay and shadow​

Origin​

Eval​

Fixture selection​

Scoring​

Schedule​

Status fields​

Judge​

Why ask for evidence_step​

Choosing a judge model​

JudgeCalibration​

Production-data flywheel​

Eval gate (on edit)​

Limits​

Status, troubleshooting​

See also​

Why this is here

Quickstart

CLI

EvalFixture

Fixture kind

Severity

Assertion types

Tool replay and shadow

Origin

Eval

Fixture selection

Scoring

Schedule

Status fields

Judge

Why ask for evidence_step

Choosing a judge model

JudgeCalibration

Production-data flywheel

Eval gate (on edit)

Limits

Status, troubleshooting

See also