Tested Behaviors

AFK’s tests describe the behavior contributors should preserve when changing the runtime, tools, memory, queues, LLM adapters, observability, and evals. Treat this page as a map from behavior to the test files that cover it. If you change a public contract, update the matching tests and the docs that explain that contract. A passing test suite is necessary, but it is not a release guarantee by itself; review the affected behavior and run the focused suite for the code you touched.

Guaranteed behaviors

Deterministic delegation ordering

What is tested: When a parent agent dispatches work to subagents, the delegation engine executes nodes in a deterministic, topologically sorted order. Parallel batches respect concurrency limits. Edge dependencies (where one subagent’s output feeds into another’s input) are resolved before the dependent node starts. Why it matters: If delegation ordering were non-deterministic, the same agent configuration could produce different results depending on task scheduling. Deterministic ordering means your subagent pipelines are reproducible and debuggable. Test coverage: Tests verify that DAG plans produce consistent node execution sequences, that edge-based data flow resolves correctly, and that backpressure limits prevent unbounded queue growth.

Queue contract failure classification

What is tested: The task queue system classifies failures into retryable, terminal, and degraded categories. Retryable failures trigger retry with backoff. Terminal failures stop execution immediately. Degraded states allow partial results. Why it matters: Incorrect failure classification can cause infinite retry loops (if terminal failures are marked retryable) or premature task abandonment (if retryable failures are marked terminal). The failure classification contract ensures that each failure type triggers the correct recovery behavior. Test coverage: Tests verify that each failure category maps to the correct queue behavior, that retry counts and backoff intervals are respected, and that dead-letter handling works correctly.

Stream lifecycle correctness

What is tested: The streaming API (runner.run_stream()) produces events in the correct lifecycle order: stream starts, text deltas arrive, tool events fire at the right times, and the stream terminates with a completed event containing the final AgentResult. Error conditions produce error stream events rather than unhandled exceptions. Why it matters: Streaming consumers (such as chat UIs) depend on events arriving in the correct order. A misplaced completed event before all text deltas have been emitted would cause truncated output. An unhandled exception would crash the consumer. Test coverage: Tests verify event ordering, ensure that all text content is captured before the terminal event, and confirm that error conditions produce structured error events.

Telemetry projection stability

What is tested: The telemetry projector produces consistent RunMetrics from the same input data. Field names, types, and computed properties (like avg_llm_latency_ms and success) are stable across versions. Why it matters: Downstream dashboards, alerting rules, and eval assertions depend on RunMetrics having a stable schema. If a field is renamed or its type changes, every consumer breaks silently. Test coverage: Tests verify that projected metrics match expected values for known input data, that computed properties produce correct results, and that to_dict() serialization is stable.

Eval report schema consistency

What is tested: The eval report serializer (suite_report_payload) produces output with a stable schema_version and consistent field structure. The report includes summary statistics, per-case results, assertion details, budget violations, and projected metrics. Why it matters: CI pipelines parse eval reports to make release-gating decisions. If the report schema changes, CI scripts break and deployments may be incorrectly blocked or allowed. Test coverage: Tests verify the report envelope structure, confirm that schema_version is set correctly, and validate that all expected fields are present and correctly typed.

Test categories

Category	Description	Key Modules Covered
Agent delegation	Deterministic DAG execution, subagent routing, backpressure limits.	`afk.core.runtime`, `afk.agents.delegation`
Queue contracts	Failure classification, retry behavior, dead-letter handling.	`afk.queues`
Stream lifecycle	Event ordering, text delta capture, terminal event correctness.	`afk.core.streaming`
Telemetry projection	Metric stability, computed property correctness, serialization.	`afk.observability.projectors`, `afk.observability.models`
Eval reports	Report schema stability, assertion result structure, budget violations.	`afk.evals.reporting`, `afk.evals.models`
Tool execution	Pydantic validation, timeout enforcement, hook/middleware chains.	`afk.tools`
Policy evaluation	Rule matching, decision actions, audit event emission.	`afk.agents.policy`
LLM runtime	Provider routing, retry/circuit-breaker behavior, streaming correctness.	`afk.llms.runtime`
Security	Sandbox profile enforcement, secret scope isolation, command allowlists.	`afk.tools.security`

Running the test suite

Run all tests from the repository root:

PYTHONPATH=src pytest -q

Run a specific test category:

# Delegation and subagent tests
PYTHONPATH=src pytest tests/core/ -q

# Queue contract tests
PYTHONPATH=src pytest tests/queues/ -q

# Eval suite tests
PYTHONPATH=src pytest tests/evals/ -q

# Tool execution tests
PYTHONPATH=src pytest tests/tools/ -q

# Observability tests
PYTHONPATH=src pytest tests/observability/ -q

# LLM runtime tests
PYTHONPATH=src pytest tests/llms/ -q

Run with verbose output for debugging:

PYTHONPATH=src pytest -v --tb=short

Interpreting test results

All relevant tests pass: The checked behavior still matches the test suite. Review docs and public imports before merging user-visible changes.
A test fails: Inspect the assertion before changing code. The test may expose a real contract regression, or the intended contract may have changed and need a deliberate test/doc update.
A new test is marked as xfail: Include a reason and keep the scope narrow so known limitations do not hide unrelated regressions.

Adding new tests

When adding a new feature or fixing a bug, follow this pattern:

Identify the contract. What behavioral guarantee should your change preserve or introduce? Write this as a plain-English statement (e.g., “Subagent timeout should produce a SubagentExecutionRecord with success=False”).
Write the test first. Create a test in the appropriate tests/ subdirectory that asserts the expected behavior. Use descriptive test names that read as contract statements.
Make the test pass. Implement the feature or fix. The test should pass without any special-casing or mocking of the behavior under test.
Verify stability. Run the full suite to confirm your change does not break existing contracts.

CI pipeline guidance

Most tests run without API keys and use in-memory or mocked providers. Some integration tests cover optional backends such as Redis, SQLite, or Postgres behavior; keep those isolated and skippable when the backing service is unavailable. Recommended CI configuration:

# Example GitHub Actions step
- name: Run AFK tests
  run: |
    pip install -e . pytest
    PYTHONPATH=src pytest -q --tb=short --junitxml=reports/test-results.xml
  env:
    PYTHONDONTWRITEBYTECODE: "1"

Store the JUnit XML report as a CI artifact for trend analysis and failure investigation.

Contributor Guide

Internal Contracts

Migration

Tested Behaviors

Guaranteed behaviors

Deterministic delegation ordering

Queue contract failure classification

Stream lifecycle correctness

Telemetry projection stability

Eval report schema consistency

Test categories

Running the test suite

Interpreting test results

Adding new tests

CI pipeline guidance

​Guaranteed behaviors

​Deterministic delegation ordering

​Queue contract failure classification

​Stream lifecycle correctness

​Telemetry projection stability

​Eval report schema consistency

​Test categories

​Running the test suite

​Interpreting test results

​Adding new tests

​CI pipeline guidance

Guaranteed behaviors

Deterministic delegation ordering

Queue contract failure classification

Stream lifecycle correctness

Telemetry projection stability

Eval report schema consistency

Test categories

Running the test suite

Interpreting test results

Adding new tests

CI pipeline guidance