Skip to main content
AFK classifies every error and applies a policy-driven response. Understanding the failure matrix helps you configure the right behavior for your use case.

Error classification

Every error is classified into one of three categories:
ClassificationMeaningDefault behavior
RetryableTransient failure, may succeed on retryRetry with exponential backoff
TerminalPermanent failure, will not recoverStop and report error
Non-fatalSomething went wrong, but the run can continueLog warning, continue

Failure decision flow

Failure matrix by source

LLM failures

ErrorClassificationExample
Rate limit (429)Retryable”Rate limit exceeded, retry after 2s”
Server error (500/502/503)Retryable”Internal server error”
TimeoutRetryable”Request timed out after 60s”
Auth error (401/403)Terminal”Invalid API key”
Invalid request (400)Terminal”Model does not exist”
Circuit breaker openTerminal (with fallback)“Circuit breaker open for provider”
Configuration:
agent = Agent(
    ...,
    fail_safe=FailSafeConfig(
        llm_failure_policy="degrade",              # "fail" or "degrade"
        fallback_model_chain=["gpt-5.2-mini"],     # Try cheaper model
    ),
)

Tool failures

ErrorClassificationExample
Validation errorNon-fatal”Invalid arguments” (returned to LLM for self-correction)
Handler exceptionConfigurable”Tool raised an error”
TimeoutConfigurable”Tool exceeded 10s timeout”
Policy denialNon-fatal”Action denied by policy” (returned to LLM)
Configuration:
agent = Agent(
    ...,
    fail_safe=FailSafeConfig(
        tool_failure_policy="continue",  # "fail" | "degrade" | "continue"
    ),
)

Subagent failures

ErrorClassificationExample
Subagent run failedConfigurable”Subagent ‘researcher’ completed with state=failed”
Subagent timeoutConfigurable”Subagent exceeded wall time”
Join policy violationTerminal”Required subagent failed (all_required policy)”
Configuration:
agent = Agent(
    ...,
    fail_safe=FailSafeConfig(
        subagent_failure_policy="degrade",  # "fail" | "degrade" | "continue"
    ),
)

Infrastructure failures

ErrorClassificationExample
Memory backend unavailableNon-fatal”Could not persist checkpoint”
Telemetry export failedNon-fatal (silent)“OTEL exporter timed out”
Queue push failedRetryable”Redis connection refused”

Failure policies

Any failure causes the run to fail immediately.
fail_safe=FailSafeConfig(tool_failure_policy="fail")
Use when: All operations are critical and partial results are worse than no results.

Budget-triggered limits

When a budget limit is hit, the run is stopped immediately:
LimitTriggered byRun state
max_stepsStep count exceededfailed or degraded
max_tool_callsTool call count exceededfailed or degraded
max_total_cost_usdEstimated cost exceededfailed
max_wall_time_sWall time exceededinterrupted

Next steps