Skip to main content
AFK’s eval framework tests agent behavior with real LLM calls. Define test cases, run them against your agents, and gate releases on pass rate. Think of evals as integration tests for AI — they verify that prompts, tools, and orchestration produce correct behavior.

Your first eval

from afk.agents import Agent
from afk.core import Runner
from afk.evals import run_suite
from afk.evals.models import EvalCase, EvalBudget

agent = Agent(
    name="classifier",
    model="gpt-5.2-mini",
    instructions="Classify as: billing, technical, account, other. Output only the label.",
)

suite = run_suite(
    runner_factory=lambda: Runner(),
    cases=[
        EvalCase(
            name="billing-question",
            agent=agent,
            user_message="Why was I charged twice?",
            assertions=[lambda r: r.final_text.strip().lower() == "billing"],
        ),
        EvalCase(
            name="technical-question",
            agent=agent,
            user_message="The API returns a 500 error",
            assertions=[lambda r: r.final_text.strip().lower() == "technical"],
        ),
    ],
)

print(f"Passed: {suite.passed}/{suite.total}")
assert suite.failed == 0

Eval lifecycle

1

Define cases

Each case specifies an agent, input message, and assertions to verify.
2

Schedule execution

The scheduler runs cases sequentially or in parallel, respecting concurrency limits.
3

Run agents

Each case runs through a full agent loop with real LLM calls.
4

Assert results

Assertions verify the result — text content, state, tool usage, cost, latency, etc.
5

Check budgets

Budget limits gate individual case costs and the total suite cost.
6

Generate report

Pass/fail results, assertion details, and metrics are collected into a report.

Eval case types

Verify correct behavior under normal conditions.
EvalCase(
    name="basic-greeting",
    agent=agent,
    user_message="Hello!",
    assertions=[
        lambda r: r.state == "completed",
        lambda r: len(r.final_text) > 0,
    ],
)

Assertions

Assertions are functions that take an AgentResult and return True (pass) or False (fail):
# Text content assertions
lambda r: "error budget" in r.final_text.lower()
lambda r: len(r.final_text) < 500

# State assertions
lambda r: r.state == "completed"
lambda r: r.state != "failed"

# Tool usage assertions
lambda r: len(r.tool_executions) > 0
lambda r: all(t.success for t in r.tool_executions)

# Cost assertions
lambda r: r.usage.total_cost_usd < 0.10

Suite configuration

from afk.evals.models import EvalSuiteConfig

config = EvalSuiteConfig(
    max_concurrency=3,         # Run up to 3 cases in parallel
    fail_fast=True,            # Stop on first failure
    timeout_per_case_s=60.0,   # Max time per case
    total_budget=EvalBudget(
        max_total_cost_usd=1.00,  # Total suite budget
    ),
)

suite = run_suite(
    runner_factory=lambda: Runner(),
    cases=cases,
    config=config,
)

CI integration

Run evals in your CI pipeline to gate releases:
# .github/workflows/evals.yml
name: Agent Evals
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install afk
      - run: python -m pytest tests/evals/ -v
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
Set a budget for CI evals. Without a budget, a broken prompt can drain your API credits during a CI run. Use EvalBudget(max_total_cost_usd=2.00) as a reasonable CI limit.

Release gating

Gate releases on eval pass rate:
suite = run_suite(runner_factory=lambda: Runner(), cases=cases)

if suite.pass_rate < 0.95:
    print(f"Release blocked: {suite.pass_rate:.0%} pass rate (need 95%)")
    sys.exit(1)

print(f"Release approved: {suite.pass_rate:.0%} pass rate")

Next steps