Skip to main content
Version: v0.3.4

Agent Execution Loop

Agents are the heart of Llama Stack applications. They combine inference, memory, safety, and tool usage into coherent workflows. At its core, an agent follows a sophisticated execution loop that enables multi-step reasoning, tool usage, and safety checks.

Steps in the Agent Workflow​

Each agent turn follows these key steps:

  1. Initial Safety Check: The user's input is first screened through configured safety shields

  2. Context Retrieval:

    • If RAG is enabled, the agent can choose to query relevant documents from memory banks. You can use the instructions field to steer the agent.
    • For new documents, they are first inserted into the memory bank.
    • Retrieved context is provided to the LLM as a tool response in the message history.
  3. Inference Loop: The agent enters its main execution loop:

    • The LLM receives a user prompt (with previous tool outputs)
    • The LLM generates a response, potentially with tool calls
    • If tool calls are present:
      • Tool inputs are safety-checked
      • Tools are executed (e.g., web search, code execution)
      • Tool responses are fed back to the LLM for synthesis
    • The loop continues until:
      • The LLM provides a final response without tool calls
      • Maximum iterations are reached
      • Token limit is exceeded
  4. Final Safety Check: The agent's final response is screened through safety shields

Execution Flow Diagram​

Each step in this process can be monitored and controlled through configurations.

Agent Execution Example​

Here's an example that demonstrates monitoring the agent's execution:

from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger

# Replace host and port
client = LlamaStackClient(base_url=f"http://{HOST}:{PORT}")

agent = Agent(
client,
# Check with `llama-stack-client models list`
model="Llama3.2-3B-Instruct",
instructions="You are a helpful assistant",
# Enable both RAG and tool usage
tools=[
{
"name": "builtin::rag/knowledge_search",
"args": {"vector_db_ids": ["my_docs"]},
},
"builtin::code_interpreter",
],
# Configure safety (optional)
input_shields=["llama_guard"],
output_shields=["llama_guard"],
# Control the inference loop
max_infer_iters=5,
sampling_params={
"strategy": {"type": "top_p", "temperature": 0.7, "top_p": 0.95},
"max_tokens": 2048,
},
)
session_id = agent.create_session("monitored_session")

# Stream the agent's execution steps
response = agent.create_turn(
messages=[{"role": "user", "content": "Analyze this code and run it"}],
documents=[
{
"content": "https://raw.githubusercontent.com/example/code.py",
"mime_type": "text/plain",
}
],
session_id=session_id,
)

# Monitor each step of execution
for log in AgentEventLogger().log(response):
log.print()

Key Configuration Options​

Loop Control​

  • max_infer_iters: Maximum number of inference iterations (default: 5)
  • max_tokens: Token limit for responses
  • temperature: Controls response randomness

Safety Configuration​

  • input_shields: Safety checks for user input
  • output_shields: Safety checks for agent responses

Tool Integration​

  • tools: List of available tools for the agent
  • tool_choice: Control over when tools are used