Version: Next

Agent Execution Loop

Agents are the heart of Llama Stack applications. They combine inference, memory, safety, and tool usage into coherent workflows. At its core, an agent follows a sophisticated execution loop that enables multi-step reasoning, tool usage, and safety checks.

Steps in the Agent Workflow

Each agent turn follows these key steps:

Initial Safety Check: The user's input is first screened through configured safety shields
Context Retrieval:
- If RAG is enabled, the agent can choose to query relevant documents from memory banks. You can use the instructions field to steer the agent.
- For new documents, they are first inserted into the memory bank.
- Retrieved context is provided to the LLM as a tool response in the message history.
Inference Loop: The agent enters its main execution loop:
- The LLM receives a user prompt (with previous tool outputs)
- The LLM generates a response, potentially with tool calls
- If tool calls are present:
  - Tool inputs are safety-checked
  - Tools are executed (e.g., web search, code execution)
  - Tool responses are fed back to the LLM for synthesis
- The loop continues until:
  - The LLM provides a final response without tool calls
  - Maximum iterations are reached
  - Token limit is exceeded
Final Safety Check: The agent's final response is screened through safety shields

Execution Flow Diagram

sequenceDiagram
    participant U as User
    participant E as Executor
    participant M as Memory Bank
    participant L as LLM
    participant T as Tools
    participant S as Safety Shield

    Note over U,S: Agent Turn Start
    U->>S: 1. Submit Prompt
    activate S
    S->>E: Input Safety Check
    deactivate S

    loop Inference Loop
        E->>L: 2.1 Augment with Context
        L-->>E: 2.2 Response (with/without tool calls)

        alt Has Tool Calls
            E->>S: Check Tool Input
            S->>T: 3.1 Execute Tool
            T-->>E: 3.2 Tool Response
            E->>L: 4.1 Tool Response
            L-->>E: 4.2 Synthesized Response
        end

        opt Stop Conditions
            Note over E: Break if:
            Note over E: - No tool calls
            Note over E: - Max iterations reached
            Note over E: - Token limit exceeded
        end
    end

    E->>S: Output Safety Check
    S->>U: 5. Final Response

Each step in this process can be monitored and controlled through configurations.

Agent Execution Example

Here's an example that demonstrates monitoring the agent's execution:

Streaming Execution
Non-Streaming Execution

from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger

# Replace host and port
client = LlamaStackClient(base_url=f"http://{HOST}:{PORT}")

agent = Agent(
    client,
    # Check with `llama-stack-client models list`
    model="Llama3.2-3B-Instruct",
    instructions="You are a helpful assistant",
    # Enable both RAG and tool usage
    tools=[
        {
            "name": "builtin::rag/knowledge_search",
            "args": {"vector_db_ids": ["my_docs"]},
        },
        "builtin::code_interpreter",
    ],
    # Configure safety (optional)
    input_shields=["llama_guard"],
    output_shields=["llama_guard"],
    # Control the inference loop
    max_infer_iters=5,
    sampling_params={
        "strategy": {"type": "top_p", "temperature": 0.7, "top_p": 0.95},
        "max_tokens": 2048,
    },
)
session_id = agent.create_session("monitored_session")

# Stream the agent's execution steps
response = agent.create_turn(
    messages=[{"role": "user", "content": "Analyze this code and run it"}],
    documents=[
        {
            "content": "https://raw.githubusercontent.com/example/code.py",
            "mime_type": "text/plain",
        }
    ],
    session_id=session_id,
)

# Monitor each step of execution
for log in AgentEventLogger().log(response):
    log.print()

from rich.pretty import pprint

# Using non-streaming API, the response contains input, steps, and output.
response = agent.create_turn(
    messages=[{"role": "user", "content": "Analyze this code and run it"}],
    documents=[
        {
            "content": "https://raw.githubusercontent.com/example/code.py",
            "mime_type": "text/plain",
        }
    ],
    session_id=session_id,
    stream=False,
)

pprint(f"Input: {response.input_messages}")
pprint(f"Output: {response.output_message.content}")
pprint(f"Steps: {response.steps}")

Key Configuration Options

Loop Control

max_infer_iters: Maximum number of inference iterations (default: 5)
max_tokens: Token limit for responses
temperature: Controls response randomness

Safety Configuration

input_shields: Safety checks for user input
output_shields: Safety checks for agent responses

Tool Integration

tools: List of available tools for the agent
tool_choice: Control over when tools are used

Agents - Understanding agent fundamentals
Tools Integration - Adding capabilities to agents
Safety Guardrails - Implementing safety measures
RAG (Retrieval Augmented Generation) - Building knowledge-enhanced workflows

Steps in the Agent Workflow​

Execution Flow Diagram​

Agent Execution Example​

Key Configuration Options​

Loop Control​

Safety Configuration​

Tool Integration​

Related Resources​