Documentation Index Fetch the complete documentation index at: https://afk.arpan.sh/llms.txt
Use this file to discover all available pages before exploring further.
AFK evals run agents against named inputs and check the resulting state, text, tool usage, budgets, and telemetry. Use them for prompt changes, tool changes, routing changes, and regression tests. They can run against real providers, test adapters, or agents configured with deterministic tools.
Your first eval
from afk.agents import Agent
from afk.core import Runner
from afk.evals import EvalCase, EvalSuiteConfig, StateCompletedAssertion, run_suite
agent = Agent(
name = "classifier" ,
model = "gpt-4.1-mini" ,
instructions = "Classify as: billing, technical, account, other. Output only the label." ,
)
suite = run_suite(
runner_factory = lambda : Runner(),
cases = [
EvalCase(
name = "billing-question" ,
agent = agent,
user_message = "Why was I charged twice?" ,
),
EvalCase(
name = "technical-question" ,
agent = agent,
user_message = "The API returns a 500 error" ,
),
],
config = EvalSuiteConfig(
assertions = (StateCompletedAssertion(),),
),
)
print ( f "Passed: { suite.passed } / { suite.total } " )
assert suite.failed == 0
Eval lifecycle
Define cases
Each case specifies an agent and input message. Suite-level assertions and budgets verify the result.
Schedule execution
The scheduler runs cases sequentially or in parallel, respecting concurrency
limits.
Run agents
Each case runs through the same runner path your application uses.
Assert results
Assertions verify the result — text content, state, tool usage, cost,
latency, etc.
Check budgets
Budget limits gate individual case costs and the total suite cost.
Generate report
Pass/fail results, assertion details, and metrics are collected into a
report.
Eval case types
Happy path
Failure path
Tool usage
Budget-constrained
Verify correct behavior under normal conditions. EvalCase(
name = "basic-greeting" ,
agent = agent,
user_message = "Hello!" ,
)
Verify graceful handling of errors and edge cases. EvalCase(
name = "invalid-input" ,
agent = agent,
user_message = "" ,
)
Verify that the agent uses tools correctly. EvalCase(
name = "uses-search" ,
agent = agent,
user_message = "Find docs about caching" ,
)
Verify that the agent stays within cost limits. EvalCase(
name = "within-budget" ,
agent = agent,
user_message = "Analyze this dataset" ,
budget = EvalBudget( max_total_cost_usd = 0.05 , max_total_tokens = 4_000 ),
)
Assertions
Assertions are suite-level callables. Import the built-ins from afk.evals, or implement the EvalAssertion protocol.
from afk.evals import FinalTextContainsAssertion, StateCompletedAssertion
assertions = (
StateCompletedAssertion(),
FinalTextContainsAssertion( "error budget" ),
)
Suite configuration
from afk.evals import EvalBudget, EvalSuiteConfig
config = EvalSuiteConfig(
max_concurrency = 3 , # Run up to 3 cases in parallel
fail_fast = True , # Stop on first failure
budget = EvalBudget(
max_total_cost_usd = 1.00 , # Total suite budget
),
)
suite = run_suite(
runner_factory = lambda : Runner(),
cases = cases,
config = config,
)
CI integration
Run evals in your CI pipeline to gate releases:
# .github/workflows/evals.yml
name : Agent Evals
on : [ pull_request ]
jobs :
eval :
runs-on : ubuntu-latest
steps :
- uses : actions/checkout@v4
- run : python -m pip install afk-py pytest
- run : python -m pytest tests/evals/ -v
env :
OPENAI_API_KEY : ${{ secrets.OPENAI_API_KEY }}
Set a budget for CI evals. Without a budget, a broken prompt can drain
your API credits during a CI run. Use EvalBudget(max_total_cost_usd=2.00) as
a reasonable CI limit.
Release gating
Gate releases on eval pass rate:
suite = run_suite( runner_factory = lambda : Runner(), cases = cases)
pass_rate = suite.passed / suite.total if suite.total else 0.0
if pass_rate < 0.95 :
print ( f "Release blocked: { pass_rate :.0%} pass rate (need 95%)" )
sys.exit( 1 )
print ( f "Release approved: { pass_rate :.0%} pass rate" )
Next steps
Security Model Security boundaries and production hardening.
Building with AI Production playbook with anti-patterns.