Skip to main content

Documentation Index

Fetch the complete documentation index at: https://afk.arpan.sh/llms.txt

Use this file to discover all available pages before exploring further.

AFK evals run agents against named inputs and check the resulting state, text, tool usage, budgets, and telemetry. Use them for prompt changes, tool changes, routing changes, and regression tests. They can run against real providers, test adapters, or agents configured with deterministic tools.

Your first eval

from afk.agents import Agent
from afk.core import Runner
from afk.evals import EvalCase, EvalSuiteConfig, StateCompletedAssertion, run_suite

agent = Agent(
    name="classifier",
    model="gpt-4.1-mini",
    instructions="Classify as: billing, technical, account, other. Output only the label.",
)

suite = run_suite(
    runner_factory=lambda: Runner(),
    cases=[
        EvalCase(
            name="billing-question",
            agent=agent,
            user_message="Why was I charged twice?",
        ),
        EvalCase(
            name="technical-question",
            agent=agent,
            user_message="The API returns a 500 error",
        ),
    ],
    config=EvalSuiteConfig(
        assertions=(StateCompletedAssertion(),),
    ),
)

print(f"Passed: {suite.passed}/{suite.total}")
assert suite.failed == 0

Eval lifecycle

1

Define cases

Each case specifies an agent and input message. Suite-level assertions and budgets verify the result.
2

Schedule execution

The scheduler runs cases sequentially or in parallel, respecting concurrency limits.
3

Run agents

Each case runs through the same runner path your application uses.
4

Assert results

Assertions verify the result — text content, state, tool usage, cost, latency, etc.
5

Check budgets

Budget limits gate individual case costs and the total suite cost.
6

Generate report

Pass/fail results, assertion details, and metrics are collected into a report.

Eval case types

Verify correct behavior under normal conditions.
EvalCase(
    name="basic-greeting",
    agent=agent,
    user_message="Hello!",
)

Assertions

Assertions are suite-level callables. Import the built-ins from afk.evals, or implement the EvalAssertion protocol.
from afk.evals import FinalTextContainsAssertion, StateCompletedAssertion

assertions = (
    StateCompletedAssertion(),
    FinalTextContainsAssertion("error budget"),
)

Suite configuration

from afk.evals import EvalBudget, EvalSuiteConfig

config = EvalSuiteConfig(
    max_concurrency=3,         # Run up to 3 cases in parallel
    fail_fast=True,            # Stop on first failure
    budget=EvalBudget(
        max_total_cost_usd=1.00,  # Total suite budget
    ),
)

suite = run_suite(
    runner_factory=lambda: Runner(),
    cases=cases,
    config=config,
)

CI integration

Run evals in your CI pipeline to gate releases:
# .github/workflows/evals.yml
name: Agent Evals
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: python -m pip install afk-py pytest
      - run: python -m pytest tests/evals/ -v
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
Set a budget for CI evals. Without a budget, a broken prompt can drain your API credits during a CI run. Use EvalBudget(max_total_cost_usd=2.00) as a reasonable CI limit.

Release gating

Gate releases on eval pass rate:
suite = run_suite(runner_factory=lambda: Runner(), cases=cases)
pass_rate = suite.passed / suite.total if suite.total else 0.0

if pass_rate < 0.95:
    print(f"Release blocked: {pass_rate:.0%} pass rate (need 95%)")
    sys.exit(1)

print(f"Release approved: {pass_rate:.0%} pass rate")

Next steps

Security Model

Security boundaries and production hardening.

Building with AI

Production playbook with anti-patterns.