Evaluations — Testing Agent Quality
Run structured evaluations against datasets, score responses, and gate deployments on quality thresholds.
Evaluations — Testing Agent Quality
AgentBreeder's built-in eval framework lets you run structured tests against any deployed agent, score responses automatically (or with an LLM judge), and block deploys when quality drops below a threshold.
Evals are first-class in the CLI and the dashboard — no external eval tool required.
Core Concepts
| Concept | Meaning |
|---|---|
| Dataset | A collection of (input, expected_output) pairs stored in the registry |
| Run | One execution of a dataset against a specific agent + model config |
| Scorer | Strategy for comparing actual vs. expected output |
| Judge | An LLM used as a scorer when output cannot be compared by string match |
| Gate | A CI check that fails if a run's score falls below a threshold |
Scorers
| Scorer | Flag | Description |
|---|---|---|
exact | --scorer exact | Fuzzy string match (Levenshtein ratio ≥ 0.85). Fast, no API calls. |
semantic | --scorer semantic | Keyword overlap ratio. Good for long-form responses. |
judge | --scorer judge | LLM-as-judge — scores on accuracy, helpfulness, safety, groundedness (0–1 each). |
Use judge whenever exact string matching is too brittle (open-ended questions, summaries, code generation).
Quickstart
1 — Create a dataset
agentbreeder eval datasets create \
--name support-qa-v2 \
--team customer-success \
--file ./datasets/support-qa.jsonlDataset file format (one JSON object per line):
{"input": "How do I reset my password?", "expected": "Visit /account/reset and enter your email."}
{"input": "What's your refund policy?", "expected": "We offer 30-day refunds on all plans."}
{"input": "Can I export my data?", "expected": "Yes — use Settings → Data Export → Download CSV."}2 — Run an eval
agentbreeder eval run support-agent \
--dataset support-qa-v2 \
--scorer judge \
--judge-model claude-haiku-4-5Example output:
╭─────────────────────────────────────────────────────────────╮
│ Eval Run — support-agent × support-qa-v2 │
│ Run ID: run-a1b2c3d4 │ Scorer: judge (claude-haiku-4-5) │
╰─────────────────────────────────────────────────────────────╯
Case 1/3 ✓ score=0.94 (accuracy=1.0 helpfulness=0.9 safety=1.0 groundedness=0.85)
Case 2/3 ✓ score=0.91 (accuracy=0.9 helpfulness=0.95 safety=1.0 groundedness=0.8)
Case 3/3 ✓ score=0.88 (accuracy=0.85 helpfulness=0.9 safety=1.0 groundedness=0.78)
┌──────────────┬───────┐
│ Metric │ Score │
├──────────────┼───────┤
│ accuracy │ 0.917 │
│ helpfulness │ 0.917 │
│ safety │ 1.000 │
│ groundedness │ 0.810 │
│ overall │ 0.911 │
└──────────────┴───────┘
✅ Run complete. Overall: 0.9113 — View results
agentbreeder eval results run-a1b2c3d4Run ID: run-a1b2c3d4
Agent: support-agent
Dataset: support-qa-v2
Model: claude-sonnet-4 (judge: claude-haiku-4-5)
Scorer: judge
Cases: 3 / 3 passed
Overall: 0.911
Per-case breakdown:
#1 input="How do I reset my password?"
expected="Visit /account/reset..."
actual="To reset your password, go to /account/reset and enter your email address."
score=0.94
#2 input="What's your refund policy?"
...4 — Gate a deployment
Block a deploy if the eval score drops below a threshold:
# In CI, after running the eval:
agentbreeder eval gate run-a1b2c3d4 --threshold 0.85
# Exits 0 if score ≥ 0.85, exits 1 otherwiseSample CI output (pass):
Gate check: run-a1b2c3d4
Threshold: 0.85
Score: 0.911
✅ Gate passed — deploying.Sample CI output (fail):
Gate check: run-b9z8y7x6
Threshold: 0.85
Score: 0.731
❌ Gate failed — score 0.731 < threshold 0.85. Deploy blocked.Comparing Two Runs
Catch regressions when you change a model or prompt:
agentbreeder eval compare run-a1b2c3d4 run-b9z8y7x6Example output:
Comparing runs:
A: run-a1b2c3d4 (support-agent, claude-sonnet-4, score=0.911)
B: run-b9z8y7x6 (support-agent, claude-haiku-4-5, score=0.731)
┌──────────────┬───────┬───────┬───────────┐
│ Metric │ Run A │ Run B │ Δ │
├──────────────┼───────┼───────┼───────────┤
│ accuracy │ 0.917 │ 0.700 │ ▼ −0.217 │
│ helpfulness │ 0.917 │ 0.733 │ ▼ −0.184 │
│ safety │ 1.000 │ 1.000 │ 0.000 │
│ groundedness │ 0.810 │ 0.590 │ ▼ −0.220 │
│ overall │ 0.911 │ 0.731 │ ▼ −0.180 │
└──────────────┴───────┴───────┴───────────┘
⚠️ Regression detected (threshold: 0.05). Run B is significantly worse on:
accuracy (−0.217), groundedness (−0.220), helpfulness (−0.184)Integrating Evals into CI
Add an eval gate to your GitHub Actions workflow:
# .github/workflows/eval-gate.yml
name: Eval Gate
on:
pull_request:
branches: [main]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install AgentBreeder
run: pip install agentbreeder
- name: Run eval
run: |
agentbreeder eval run ${{ env.AGENT_NAME }} \
--dataset ${{ env.DATASET_ID }} \
--scorer judge \
--judge-model claude-haiku-4-5 \
--json > eval-result.json
env:
AGENTBREEDER_API_URL: ${{ secrets.AGENTBREEDER_API_URL }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- name: Gate on score
run: |
RUN_ID=$(jq -r '.run_id' eval-result.json)
agentbreeder eval gate "$RUN_ID" --threshold 0.85
- name: Upload eval result
if: always()
uses: actions/upload-artifact@v4
with:
name: eval-result
path: eval-result.jsonThe eval-on-pr.yml workflow in the AgentBreeder repo itself uses this pattern to gate agent changes on score ≥ 0.80.
CLI Reference
agentbreeder eval run AGENT_NAME --dataset DATASET_ID [OPTIONS]
Options:
--dataset, -d TEXT Dataset ID to run against [required]
--model, -m TEXT Model override
--temperature, -T FLOAT Temperature override
--scorer TEXT exact | semantic | judge [default: exact]
--judge, --judge-model TEXT LLM to use for judge scoring
--json Output as JSON
agentbreeder eval datasets [--team TEAM] [--json]
agentbreeder eval results RUN_ID [--json]
agentbreeder eval compare RUN_A RUN_B [--regression-threshold 0.05] [--json]
agentbreeder eval gate RUN_ID [--threshold 0.7] [--metrics METRICS] [--json]See agentbreeder eval --help for the full reference.
Dashboard
All eval runs are visible in the Evaluations tab of the AgentBreeder dashboard:
- Run history per agent
- Per-case diffs between runs
- Score trend charts over time
- One-click re-run with the same or a different scorer
Creating Datasets via API
curl -X POST https://your-agentbreeder/api/v1/evals/datasets \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "support-qa-v2",
"team": "customer-success",
"cases": [
{"input": "How do I reset my password?", "expected": "Visit /account/reset..."},
{"input": "What is your refund policy?", "expected": "30-day refunds on all plans."}
]
}'Playground — Chat with Models and Agents
The dashboard's /playground page lets you test a model directly via the LiteLLM gateway, or chat with a deployed agent's runtime through the API proxy.
Connectors — Integration Compatibility Matrix
Supported connectors, their stability status, and configuration reference.