Run structured evaluations against datasets, score responses, and gate deployments on quality thresholds.

Evaluations — Testing Agent Quality

AgentBreeder's built-in eval framework lets you run structured tests against any deployed agent, score responses automatically (or with an LLM judge), and block deploys when quality drops below a threshold.

Evals are first-class in the CLI and the dashboard — no external eval tool required.

Core Concepts

Concept	Meaning
Dataset	A collection of `(input, expected_output)` pairs stored in the registry
Run	One execution of a dataset against a specific agent + model config
Scorer	Strategy for comparing actual vs. expected output
Judge	An LLM used as a scorer when output cannot be compared by string match
Gate	A CI check that fails if a run's score falls below a threshold

Scorers

Scorer	Flag	Description
`exact`	`--scorer exact`	Fuzzy string match (Levenshtein ratio ≥ 0.85). Fast, no API calls.
`semantic`	`--scorer semantic`	Keyword overlap ratio. Good for long-form responses.
`judge`	`--scorer judge`	LLM-as-judge — scores on accuracy, helpfulness, safety, groundedness (0–1 each).

Use judge whenever exact string matching is too brittle (open-ended questions, summaries, code generation).

Quickstart

1 — Create a dataset

agentbreeder eval datasets create \
  --name support-qa-v2 \
  --team customer-success \
  --file ./datasets/support-qa.jsonl

Dataset file format (one JSON object per line):

{"input": "How do I reset my password?", "expected": "Visit /account/reset and enter your email."}
{"input": "What's your refund policy?", "expected": "We offer 30-day refunds on all plans."}
{"input": "Can I export my data?", "expected": "Yes — use Settings → Data Export → Download CSV."}

2 — Run an eval

agentbreeder eval run support-agent \
  --dataset support-qa-v2 \
  --scorer judge \
  --judge-model claude-haiku-4-5

Example output:

╭─────────────────────────────────────────────────────────────╮
│  Eval Run — support-agent × support-qa-v2                   │
│  Run ID: run-a1b2c3d4  │  Scorer: judge (claude-haiku-4-5)  │
╰─────────────────────────────────────────────────────────────╯

  Case 1/3 ✓  score=0.94  (accuracy=1.0 helpfulness=0.9 safety=1.0 groundedness=0.85)
  Case 2/3 ✓  score=0.91  (accuracy=0.9 helpfulness=0.95 safety=1.0 groundedness=0.8)
  Case 3/3 ✓  score=0.88  (accuracy=0.85 helpfulness=0.9 safety=1.0 groundedness=0.78)

  ┌──────────────┬───────┐
  │ Metric       │ Score │
  ├──────────────┼───────┤
  │ accuracy     │ 0.917 │
  │ helpfulness  │ 0.917 │
  │ safety       │ 1.000 │
  │ groundedness │ 0.810 │
  │ overall      │ 0.911 │
  └──────────────┴───────┘

  ✅  Run complete. Overall: 0.911

3 — View results

agentbreeder eval results run-a1b2c3d4

Run ID:    run-a1b2c3d4
Agent:     support-agent
Dataset:   support-qa-v2
Model:     claude-sonnet-4  (judge: claude-haiku-4-5)
Scorer:    judge
Cases:     3 / 3 passed
Overall:   0.911

Per-case breakdown:
  #1  input="How do I reset my password?"
      expected="Visit /account/reset..."
      actual="To reset your password, go to /account/reset and enter your email address."
      score=0.94

  #2  input="What's your refund policy?"
      ...

4 — Gate a deployment

Block a deploy if the eval score drops below a threshold:

# In CI, after running the eval:
agentbreeder eval gate run-a1b2c3d4 --threshold 0.85
# Exits 0 if score ≥ 0.85, exits 1 otherwise

Sample CI output (pass):

Gate check: run-a1b2c3d4
  Threshold: 0.85
  Score:     0.911
  ✅  Gate passed — deploying.

Sample CI output (fail):

Gate check: run-b9z8y7x6
  Threshold: 0.85
  Score:     0.731
  ❌  Gate failed — score 0.731 < threshold 0.85. Deploy blocked.

Comparing Two Runs

Catch regressions when you change a model or prompt:

agentbreeder eval compare run-a1b2c3d4 run-b9z8y7x6

Example output:

Comparing runs:
  A: run-a1b2c3d4  (support-agent, claude-sonnet-4,  score=0.911)
  B: run-b9z8y7x6  (support-agent, claude-haiku-4-5, score=0.731)

  ┌──────────────┬───────┬───────┬───────────┐
  │ Metric       │ Run A │ Run B │ Δ         │
  ├──────────────┼───────┼───────┼───────────┤
  │ accuracy     │ 0.917 │ 0.700 │ ▼ −0.217  │
  │ helpfulness  │ 0.917 │ 0.733 │ ▼ −0.184  │
  │ safety       │ 1.000 │ 1.000 │   0.000   │
  │ groundedness │ 0.810 │ 0.590 │ ▼ −0.220  │
  │ overall      │ 0.911 │ 0.731 │ ▼ −0.180  │
  └──────────────┴───────┴───────┴───────────┘

  ⚠️  Regression detected (threshold: 0.05). Run B is significantly worse on:
        accuracy (−0.217), groundedness (−0.220), helpfulness (−0.184)

Integrating Evals into CI

Add an eval gate to your GitHub Actions workflow:

# .github/workflows/eval-gate.yml
name: Eval Gate

on:
  pull_request:
    branches: [main]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install AgentBreeder
        run: pip install agentbreeder

      - name: Run eval
        run: |
          agentbreeder eval run ${{ env.AGENT_NAME }} \
            --dataset ${{ env.DATASET_ID }} \
            --scorer judge \
            --judge-model claude-haiku-4-5 \
            --json > eval-result.json
        env:
          AGENTBREEDER_API_URL: ${{ secrets.AGENTBREEDER_API_URL }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

      - name: Gate on score
        run: |
          RUN_ID=$(jq -r '.run_id' eval-result.json)
          agentbreeder eval gate "$RUN_ID" --threshold 0.85

      - name: Upload eval result
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: eval-result
          path: eval-result.json

The eval-on-pr.yml workflow in the AgentBreeder repo itself uses this pattern to gate agent changes on score ≥ 0.80.

CLI Reference

agentbreeder eval run AGENT_NAME --dataset DATASET_ID [OPTIONS]

  Options:
    --dataset, -d TEXT         Dataset ID to run against  [required]
    --model, -m TEXT           Model override
    --temperature, -T FLOAT    Temperature override
    --scorer TEXT              exact | semantic | judge  [default: exact]
    --judge, --judge-model TEXT  LLM to use for judge scoring
    --json                     Output as JSON

agentbreeder eval datasets [--team TEAM] [--json]

agentbreeder eval results RUN_ID [--json]

agentbreeder eval compare RUN_A RUN_B [--regression-threshold 0.05] [--json]

agentbreeder eval gate RUN_ID [--threshold 0.7] [--metrics METRICS] [--json]

See agentbreeder eval --help for the full reference.

Dashboard

All eval runs are visible in the Evaluations tab of the AgentBreeder dashboard:

Run history per agent
Per-case diffs between runs
Score trend charts over time
One-click re-run with the same or a different scorer

Creating Datasets via API

curl -X POST https://your-agentbreeder/api/v1/evals/datasets \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "support-qa-v2",
    "team": "customer-success",
    "cases": [
      {"input": "How do I reset my password?", "expected": "Visit /account/reset..."},
      {"input": "What is your refund policy?", "expected": "30-day refunds on all plans."}
    ]
  }'

Evaluations — Testing Agent Quality

On this page