Skip to main content

Responses API Internal Flow

The Responses API orchestrates inference, tool execution, safety checks, and state persistence in a single request. The flow changes depending on which parameters you pass.

Use the toggles below to select parameters and see how the request flows through internal subsystems. Pick a preset to see common patterns, or build your own combination.

ClientFastAPIResponsesOrchestratorInferenceStoreloop [until no tool_calls or max_iters]POST /v1/responsescreate_openai_response()create_response()openai_chat_completion()completion + tool_callsupsert_response_object()OpenAIResponseObject
Python
from openai import OpenAI


client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")
response = client.responses.create(
    model="llama-3.3-70b",
    input="Explain how transformers work",
)


print(response.output_text)

Every request enters through FastAPI routes and is delegated to the Responses provider. The streaming orchestrator manages the inference loop — calling the LLM and executing requested server-side tools until the model produces a final response, emits a client-side function_call, or reaches max_infer_iters.

Legend

Arrow styleMeaning
Solid tealRequest (outgoing call to a subsystem)
Solid grayResponse (return value from a subsystem)
Dashed amberSSE event (streaming to client)
Dashed purpleAsync operation (background queue, polling)

The dashed box marks the inference loop — the model calls tools, receives results, and calls inference again until no more server-side tool calls are needed, a client-side function_call is returned, or max_infer_iters is reached.