Troubleshooting

This guide covers common issues encountered when building and deploying AFK agents, with solutions and debugging tips.

Agent behavior issues

Agent keeps calling the same tool repeatedly

Symptoms: Agent enters a loop, calling the same tool multiple times without making progress. Causes:

Tool output doesn’t provide the information the agent needs
Agent instructions don’t clarify when to stop
Missing a tool that would help the agent determine completion

Solutions:

from afk.agents import FailSafeConfig

# Add hard limits to prevent runaway loops
agent = Agent(
    name="safe-agent",
    model="gpt-5.5",
    instructions="Complete the task in at most 3 tool calls. If you can't solve it, say so.",
    fail_safe=FailSafeConfig(
        max_tool_calls=5,  # Stop after 5 calls
    ),
)

Debug: Enable verbose logging to see tool call inputs/outputs:

import logging
logging.basicConfig(level=logging.DEBUG)

runner = Runner(telemetry="console")

Agent ignores tools and doesn’t call them

Symptoms: Agent responds with text but doesn’t use available tools. Causes:

Instructions don’t mention the tools or when to use them
Tool descriptions are unclear
Model being used doesn’t support function calling well

Solutions:

agent = Agent(
    name="helpful",
    model="gpt-5.5",
    instructions="""
    You have access to the following tools:
    - search_docs: Use this to find information in the knowledge base
    - calculator: Use this for any math calculations
    
    Always use tools when the user asks questions that require specific information or calculations.
    """,
    tools=[search_docs, calculator],
)

Agent produces inconsistent outputs

Symptoms: Same input produces different outputs on different runs. Causes:

Temperature is set too high
Missing structured output configuration
Non-deterministic system prompt

Solutions:

# Use request-level sampling controls for direct LLM calls
from afk.llms import LLMBuilder, LLMRequest, Message

client = (
    LLMBuilder()
    .provider("openai")
    .model("gpt-5.5")
    .build()
)

response = await client.chat(
    LLMRequest(
        model="gpt-5.5",
        messages=[Message(role="user", content="Classify this ticket")],
        temperature=0.0,
    )
)

agent = Agent(
    name="deterministic",
    model=client,
    instructions="Always respond in JSON format as specified.",
)

Memory issues

Conversation doesn’t persist between runs

Symptoms: Agent doesn’t remember previous messages. Causes:

Not using thread_id to link conversations
Memory store not configured correctly
Using in-memory store (loses state on restart)

Solution:

# Always use thread_id for multi-turn conversations
thread_id = "user-123-session-1"  # Consistent per user/conversation

r1 = await runner.run(agent, user_message="Hi", thread_id=thread_id)
r2 = await runner.run(agent, user_message="What did I just say?", thread_id=thread_id)
# r2 will remember r1's context

Check memory backend:

# Verify memory is configured
print(runner._memory_store)  # Should not be None

# For production, use persistent storage
runner = Runner(
    memory_store=SQLiteMemoryStore(path="./memory.sqlite3")
)

Resume doesn’t work

Symptoms: Calling runner.resume() doesn’t continue from where the run stopped. Solutions:

# Check run_id and thread_id are correct
print(result.run_id)      # Use this for resume
print(result.thread_id)    # Use this for thread

# Resume correctly
resumed = await runner.resume(
    agent,
    run_id=result.run_id,
    thread_id=result.thread_id,
)

Debug checkpoints:

# Check checkpoint state directly from the configured memory store
rows = await runner._memory_store.list_state(result.thread_id, prefix=f"checkpoint:{result.run_id}:")
print(f"Found {len(rows)} checkpoint records")

LLM issues

Rate limit errors

Symptoms: RateLimitError or 429 responses from LLM provider. Solutions:

from afk.llms import LLMSettings, RateLimitPolicy, create_llm_client

client = create_llm_client(
    provider="openai",
    settings=LLMSettings(default_model="gpt-5.5"),
    rate_limit_policy=RateLimitPolicy(requests_per_second=0.5, burst=5),
)

# Or use exponential backoff for retries
from afk.llms import RetryPolicy

client = create_llm_client(
    provider="openai",
    settings=LLMSettings(default_model="gpt-5.5"),
    retry_policy=RetryPolicy(max_retries=5, backoff_base_s=2.0),
)

Timeout errors

Symptoms: Requests hang or timeout before completing. Solutions:

# Set appropriate timeouts
from afk.llms import TimeoutPolicy

client = create_llm_client(
    provider="openai",
    settings=LLMSettings(default_model="gpt-5.5"),
    timeout_policy=TimeoutPolicy(request_timeout_s=120.0),
)

# Or per-request timeout via middleware
from afk.llms.middleware.timeout import TimeoutMiddleware, TimeoutConfig

config = TimeoutConfig(
    default_timeout_s=60.0,
    chat_timeout_s=120.0,  # Longer for complex reasoning
)

Model not found errors

Symptoms: ModelNotFoundError or InvalidRequestError. Solutions:

# Verify model name is correct
client = (
    LLMBuilder()
    .provider("openai")
    .model("gpt-5.5")  # Check exact model name
    .build()
)

# Use fallback for resilience
agent = Agent(
    name="resilient",
    model="gpt-5.5",  # Primary model
    fail_safe=FailSafeConfig(
        fallback_model_chain=["gpt-5.5", "gpt-5.5"],
    ),
)

Streaming issues

Streaming doesn’t work

Symptoms: run_stream() doesn’t return events or returns them all at once. Solutions:

# Make sure you're iterating correctly
handle = await runner.run_stream(agent, user_message="Tell me a story")

async for event in handle:
    if event.type == "text_delta":
        print(event.text_delta, end="")
    elif event.type == "completed":
        print(f"\n\nDone: {event.result.state}")

# Don't mix sync and async
# WRONG:
result = runner.run_sync(agent, ...)  # Sync
handle = await runner.run_stream(...)  # Async on same runner

# RIGHT:
handle = await runner.run_stream(agent, ...)

Streaming disconnects early

Symptoms: Stream ends before completion. Solutions:

# Use timeout middleware for streaming
from afk.llms.middleware.timeout import TimeoutMiddleware, TimeoutConfig

config = TimeoutConfig(stream_timeout_s=180.0)  # 3 min for long streams

handle = await runner.run_stream(agent, user_message="Write a long essay...")
try:
    async for event in handle:
        # process events
        pass
except asyncio.TimeoutError:
    print("Stream timed out")

Cost issues

Unexpected high costs

Symptoms: API costs much higher than expected. Causes:

Agent in a loop making many LLM calls
No cost limits configured
Expensive model being used unnecessarily

Solutions:

# ALWAYS set cost limits
agent = Agent(
    name="safe",
    model="gpt-5.5",
    fail_safe=FailSafeConfig(
        max_total_cost_usd=0.50,  # Stop at $0.50
    ),
)

from afk.observability import project_run_metrics_from_result

metrics = project_run_metrics_from_result(result)
print(metrics.estimated_cost_usd)

Token limit errors

Symptoms: ContextLengthExceeded or similar errors. Solutions:

# Compact memory to reduce context
await runner.compact_thread(
    thread_id=thread_id,
    event_policy=RetentionPolicy(max_events_per_thread=100),
)

# Or use a model with larger context
client = (
    LLMBuilder()
    .provider("openai")
    .model("gpt-5.5")  # Larger context than gpt-5.5
    .build()
)

Tool issues

Tool validation errors

Symptoms: ToolValidationError when tools are called. Solutions:

# Ensure Pydantic model matches tool implementation
class SearchArgs(BaseModel):
    query: str
    limit: int = Field(default=10, ge=1, le=100)  # Add constraints

@tool(args_model=SearchArgs, name="search", description="Search for documents.")
def search(args: SearchArgs) -> dict:
    # Implementation
    return {"results": []}

Tool not found errors

Symptoms: Agent can’t find or call a tool. Solutions:

# Verify tool is attached to agent
print(agent.tools)  # Should include your tool

# Verify tool name matches
@tool(name="my_tool", description="Do the thing.")
def my_tool(args):
    return {"ok": True}

# Call with exact name
agent = Agent(
    name="demo",
    tools=[my_tool],  # Tool function, not name string
)

Debug mode

Enable debug mode for detailed logging:

from afk.core import Runner, RunnerConfig

runner = Runner(
    config=RunnerConfig(
        debug=True,
        sanitize_tool_output=True,
    ),
)

Getting help

If you can’t resolve an issue:

Check the GitHub Issues for known issues
Enable debug logging and capture the full traceback
Include these details when reporting:
- AFK version (pip show afk)
- Python version
- LLM provider and model
- Minimal reproduction code
- Full error traceback

Next steps

Core Concepts

Understand how AFK components work together.

Evals

Test agent behavior before shipping.

Building with AI

Common patterns and anti-patterns.

API Reference

Detailed API documentation.

Start Here

Core Building Blocks

LLM Runtime

Production

Integrations

Troubleshooting

Agent behavior issues

Agent keeps calling the same tool repeatedly

Agent ignores tools and doesn’t call them

Agent produces inconsistent outputs

Memory issues

Conversation doesn’t persist between runs

Resume doesn’t work

LLM issues

Rate limit errors

Timeout errors

Model not found errors

Streaming issues

Streaming doesn’t work

Streaming disconnects early

Cost issues

Unexpected high costs

Token limit errors

Tool issues

Tool validation errors

Tool not found errors

Debug mode

Getting help

Next steps

Core Concepts

Evals

Building with AI

API Reference

​Agent behavior issues

​Agent keeps calling the same tool repeatedly

​Agent ignores tools and doesn’t call them

​Agent produces inconsistent outputs

​Memory issues

​Conversation doesn’t persist between runs

​Resume doesn’t work

​LLM issues

​Rate limit errors

​Timeout errors

​Model not found errors

​Streaming issues

​Streaming doesn’t work

​Streaming disconnects early

​Cost issues

​Unexpected high costs

​Token limit errors

​Tool issues

​Tool validation errors

​Tool not found errors

​Debug mode

​Getting help

​Next steps

Core Concepts

Evals

Building with AI

API Reference

Agent behavior issues

Agent keeps calling the same tool repeatedly

Agent ignores tools and doesn’t call them

Agent produces inconsistent outputs

Memory issues

Conversation doesn’t persist between runs

Resume doesn’t work

LLM issues

Rate limit errors

Timeout errors

Model not found errors

Streaming issues

Streaming doesn’t work

Streaming disconnects early

Cost issues

Unexpected high costs

Token limit errors

Tool issues

Tool validation errors

Tool not found errors

Debug mode

Getting help

Next steps