Symptoms: Agent enters a loop, calling the same tool multiple times without making progress.Causes:
Tool output doesn’t provide the information the agent needs
Agent instructions don’t clarify when to stop
Missing a tool that would help the agent determine completion
Solutions:
from afk.agents import FailSafeConfig# Add hard limits to prevent runaway loopsagent = Agent( name="safe-agent", model="gpt-4.1-mini", instructions="Complete the task in at most 3 tool calls. If you can't solve it, say so.", fail_safe=FailSafeConfig( max_tool_calls=5, # Stop after 5 calls ),)
Debug: Enable verbose logging to see tool call inputs/outputs:
Symptoms: Agent responds with text but doesn’t use available tools.Causes:
Instructions don’t mention the tools or when to use them
Tool descriptions are unclear
Model being used doesn’t support function calling well
Solutions:
agent = Agent( name="helpful", model="gpt-4.1-mini", instructions=""" You have access to the following tools: - search_docs: Use this to find information in the knowledge base - calculator: Use this for any math calculations Always use tools when the user asks questions that require specific information or calculations. """, tools=[search_docs, calculator],)
# Always use thread_id for multi-turn conversationsthread_id = "user-123-session-1" # Consistent per user/conversationr1 = await runner.run(agent, user_message="Hi", thread_id=thread_id)r2 = await runner.run(agent, user_message="What did I just say?", thread_id=thread_id)# r2 will remember r1's context
Check memory backend:
# Verify memory is configuredprint(runner._memory_store) # Should not be None# For production, use persistent storagerunner = Runner( memory_store=SQLiteMemoryStore(path="./memory.sqlite3"))
Symptoms: Calling runner.resume() doesn’t continue from where the run stopped.Solutions:
# Check run_id and thread_id are correctprint(result.run_id) # Use this for resumeprint(result.thread_id) # Use this for thread# Resume correctlyresumed = await runner.resume( agent, run_id=result.run_id, thread_id=result.thread_id,)
Debug checkpoints:
# Check checkpoint state directly from the configured memory storerows = await runner._memory_store.list_state(result.thread_id, prefix=f"checkpoint:{result.run_id}:")print(f"Found {len(rows)} checkpoint records")
Symptoms:ModelNotFoundError or InvalidRequestError.Solutions:
# Verify model name is correctclient = ( LLMBuilder() .provider("openai") .model("gpt-4.1-mini") # Check exact model name .build())# Use fallback for resilienceagent = Agent( name="resilient", model="gpt-4.1", # Primary model fail_safe=FailSafeConfig( fallback_model_chain=["gpt-4.1-mini", "gpt-4.1-nano"], ),)
Symptoms: Stream ends before completion.Solutions:
# Use timeout middleware for streamingfrom afk.llms.middleware.timeout import TimeoutMiddleware, TimeoutConfigconfig = TimeoutConfig(stream_timeout_s=180.0) # 3 min for long streamshandle = await runner.run_stream(agent, user_message="Write a long essay...")try: async for event in handle: # process events passexcept asyncio.TimeoutError: print("Stream timed out")
Symptoms:ContextLengthExceeded or similar errors.Solutions:
# Compact memory to reduce contextawait runner.compact_thread( thread_id=thread_id, event_policy=RetentionPolicy(max_events_per_thread=100),)# Or use a model with larger contextclient = ( LLMBuilder() .provider("openai") .model("gpt-4.1") # Larger context than gpt-4.1-mini .build())
Symptoms: Agent can’t find or call a tool.Solutions:
# Verify tool is attached to agentprint(agent.tools) # Should include your tool# Verify tool name matches@tool(name="my_tool", description="Do the thing.")def my_tool(args): return {"ok": True}# Call with exact nameagent = Agent( name="demo", tools=[my_tool], # Tool function, not name string)