Documentation Index Fetch the complete documentation index at: https://afk.arpan.sh/llms.txt
Use this file to discover all available pages before exploring further.
AFK classifies every error and applies a policy-driven response. Understanding the failure matrix helps you configure the right behavior for your use case.
Error classification
Every error is classified into one of three categories:
Classification Meaning Default behavior Retryable Transient failure, may succeed on retry Retry with exponential backoff Terminal Permanent failure, will not recover Stop and report error Non-fatal Something went wrong, but the run can continue Log warning, continue
Failure decision flow
Failure matrix by source
LLM failures
Error Classification Example Rate limit (429) Retryable ”Rate limit exceeded, retry after 2s” Server error (500/502/503) Retryable ”Internal server error” Timeout Retryable ”Request timed out after 60s” Auth error (401/403) Terminal ”Invalid API key” Invalid request (400) Terminal ”Model does not exist” Circuit breaker open Terminal (with fallback) “Circuit breaker open for provider”
Configuration:
agent = Agent(
... ,
fail_safe = FailSafeConfig(
llm_failure_policy = "degrade" , # "fail" or "degrade"
fallback_model_chain = [ "gpt-4.1-mini" ], # Try cheaper model
),
)
Error Classification Example Validation error Non-fatal ”Invalid arguments” (returned to LLM for self-correction) Handler exception Configurable ”Tool raised an error” Timeout Configurable ”Tool exceeded 10s timeout” Policy denial Non-fatal ”Action denied by policy” (returned to LLM)
Configuration:
agent = Agent(
... ,
fail_safe = FailSafeConfig(
tool_failure_policy = "continue" , # "fail" | "degrade" | "continue"
),
)
Subagent failures
Error Classification Example Subagent run failed Configurable ”Subagent ‘researcher’ completed with state=failed” Subagent timeout Configurable ”Subagent exceeded wall time” Join policy violation Terminal ”Required subagent failed (all_required policy)”
Configuration:
agent = Agent(
... ,
fail_safe = FailSafeConfig(
subagent_failure_policy = "degrade" , # "fail" | "degrade" | "continue"
),
)
Infrastructure failures
Error Classification Example Memory backend unavailable Non-fatal ”Could not persist checkpoint” Telemetry export failed Non-fatal (silent) “OTEL exporter timed out” Queue push failed Retryable ”Redis connection refused”
Failure policies
Any failure causes the run to fail immediately. fail_safe = FailSafeConfig( tool_failure_policy = "fail" )
Use when: All operations are critical and partial results are worse than no results.Failures are tolerated. The run completes with state="degraded" instead of "completed". fail_safe = FailSafeConfig( tool_failure_policy = "degrade" )
Use when: Partial results are better than no results (e.g., some tool fails but the agent can still answer).Failures are logged but ignored. The run continues as if nothing happened. fail_safe = FailSafeConfig( tool_failure_policy = "continue" )
Use when: The failing component is non-essential (e.g., analytics tool, optional enrichment).
Budget-triggered limits
When a budget limit is hit, the run is stopped immediately:
Limit Triggered by Run state max_stepsStep count exceeded failed or degradedmax_tool_callsTool call count exceeded failed or degradedmax_total_cost_usdEstimated cost exceeded failedmax_wall_time_sWall time exceeded interrupted
Next steps
Security Model Security boundaries and hardening checklist.
Core Runner Run lifecycle and state management.