Record-Replay System

Understanding how Llama Stack captures and replays API interactions for testing.

Overview

The record-replay system solves a fundamental challenge in AI testing: how do you test against expensive, non-deterministic APIs without breaking the bank or dealing with flaky tests?

The solution: intercept API calls, store real responses, and replay them later. This gives you real API behavior without the cost or variability.

How It Works

Request Hashing

Every API request gets converted to a deterministic hash for lookup:

def normalize_request(method: str, url: str, headers: dict, body: dict) -> str:
    normalized = {
        "method": method.upper(),
        "endpoint": urlparse(url).path,  # Just the path, not full URL
        "body": body,  # Request parameters
    }
    return hashlib.sha256(json.dumps(normalized, sort_keys=True).encode()).hexdigest()

Key insight: The hashing is intentionally precise. Different whitespace, float precision, or parameter order produces different hashes. This prevents subtle bugs from false cache hits.

# These produce DIFFERENT hashes:
{"content": "Hello world"}
{"content": "Hello   world\n"}
{"temperature": 0.7}
{"temperature": 0.7000001}

Client Interception

The system patches OpenAI and Ollama client methods to intercept calls before they leave your application. This happens transparently - your test code doesn't change.

Storage Architecture

Recordings are stored as JSON files in the recording directory. They are looked up by their request hash.

recordings/
└── responses/
    ├── abc123def456.json  # Individual response files
    └── def789ghi012.json

JSON files store complete request/response pairs in human-readable format for debugging.

Recording Modes

LIVE Mode

Direct API calls with no recording or replay:

from llama_stack.testing.api_recorder import api_recording, APIRecordingMode

with api_recording(mode=APIRecordingMode.LIVE):
    response = await client.chat.completions.create(...)

Use for initial development and debugging against real APIs.

RECORD Mode

Captures API interactions while passing through real responses:

with api_recording(mode=APIRecordingMode.RECORD, storage_dir="./recordings"):
    response = await client.chat.completions.create(...)
    # Real API call made, response captured AND returned

The recording process:

Request intercepted and hashed
Real API call executed
Response captured and serialized
Recording stored to disk
Original response returned to caller

REPLAY Mode

Returns stored responses instead of making API calls:

with api_recording(mode=APIRecordingMode.REPLAY, storage_dir="./recordings"):
    response = await client.chat.completions.create(...)
    # No API call made, cached response returned instantly

The replay process:

Request intercepted and hashed
Hash looked up in SQLite index
Response loaded from JSON file
Response deserialized and returned
Error if no recording found

Streaming Support

Streaming APIs present a unique challenge: how do you capture an async generator?

The Problem

# How do you record this?
async for chunk in client.chat.completions.create(stream=True):
    process(chunk)

The Solution

The system captures all chunks immediately before yielding any:

async def handle_streaming_record(response):
    # Capture complete stream first
    chunks = []
    async for chunk in response:
        chunks.append(chunk)

    # Store complete recording
    storage.store_recording(
        request_hash, request_data, {"body": chunks, "is_streaming": True}
    )

    # Return generator that replays captured chunks
    async def replay_stream():
        for chunk in chunks:
            yield chunk

    return replay_stream()

This ensures:

Complete capture - The entire stream is saved atomically
Interface preservation - The returned object behaves like the original API
Deterministic replay - Same chunks in the same order every time

Serialization

API responses contain complex Pydantic objects that need careful serialization:

def _serialize_response(response):
    if hasattr(response, "model_dump"):
        # Preserve type information for proper deserialization
        return {
            "__type__": f"{response.__class__.__module__}.{response.__class__.__qualname__}",
            "__data__": response.model_dump(mode="json"),
        }
    return response

This preserves type safety - when replayed, you get the same Pydantic objects with all their validation and methods.

Environment Integration

Environment Variables

Control recording behavior globally:

export LLAMA_STACK_TEST_INFERENCE_MODE=replay   # this is the default
export LLAMA_STACK_TEST_RECORDING_DIR=/path/to/recordings   # default is tests/integration/recordings
pytest tests/integration/

Pytest Integration

The system integrates automatically based on environment variables, requiring no changes to test code.

Debugging Recordings

Inspecting Storage

# See what's recorded
sqlite3 recordings/index.sqlite "SELECT endpoint, model, timestamp FROM recordings LIMIT 10;"

# View specific response
cat recordings/responses/abc123def456.json | jq '.response.body'

# Find recordings by endpoint
sqlite3 recordings/index.sqlite "SELECT * FROM recordings WHERE endpoint='/v1/chat/completions';"

Common Issues

Hash mismatches: Request parameters changed slightly between record and replay

# Compare request details
cat recordings/responses/abc123.json | jq '.request'

Serialization errors: Response types changed between versions

# Re-record with updated types
rm recordings/responses/failing_hash.json
LLAMA_STACK_TEST_INFERENCE_MODE=record pytest test_failing.py

Missing recordings: New test or changed parameters

# Record the missing interaction
LLAMA_STACK_TEST_INFERENCE_MODE=record pytest test_new.py

Design Decisions

Why Not Mocks?

Traditional mocking breaks down with AI APIs because:

Response structures are complex and evolve frequently
Streaming behavior is hard to mock correctly
Edge cases in real APIs get missed
Mocks become brittle maintenance burdens

Why Precise Hashing?

Loose hashing (normalizing whitespace, rounding floats) seems convenient but hides bugs. If a test changes slightly, you want to know about it rather than accidentally getting the wrong cached response.

Why JSON + SQLite?

JSON - Human readable, diff-friendly, easy to inspect and modify
SQLite - Fast indexed lookups without loading response bodies
Hybrid - Best of both worlds for different use cases

This system provides reliable, fast testing against real AI APIs while maintaining the ability to debug issues when they arise.

Provider-Specific: AWS Bedrock

Bedrock integration tests use the record/replay mechanism. Tests run against pre-recorded API responses in CI (replay mode), eliminating the need for AWS credentials.

Prerequisites

AWS Account with access to Amazon Bedrock
Model Access: Request access to openai.gpt-oss-20b-1:0 in us-west-2 region via AWS Console
AWS CLI configured with valid credentials

Generate Bearer Token

Bedrock provides short-term API keys via the AWS Console (expires in 12 hours):

Go to Amazon Bedrock Console (ensure you're in us-west-2 region)
In the left sidebar under Discover, click API keys
Click Generate short-term API keys
Copy the generated API key

export AWS_BEARER_TOKEN_BEDROCK=bedrock-api-key-YmVkcm9jay5hbWF6b25hd3MuY29...
export AWS_DEFAULT_REGION=us-west-2

Stack Configuration

Use ci-tests::config.yaml for Bedrock tests. This config pre-registers the Bedrock model and provider, enabling replay mode without additional setup.

Recording Bedrock Tests

Record the three test functions (which generate 6 parametrized tests):

uv run pytest -v -s \
  tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming \
  tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming \
  tests/integration/inference/test_openai_completion.py::test_inference_store \
  --setup=bedrock \
  --stack-config=ci-tests::config.yaml \
  --inference-mode=record \
  -k "client_with_models"

Verifying Replay

# Test without credentials (replay mode)
unset AWS_BEARER_TOKEN_BEDROCK

uv run pytest -v -s \
  tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming \
  --setup=bedrock \
  --stack-config=ci-tests::config.yaml \
  --inference-mode=replay \
  -k "client_with_models"

Bedrock API Limitations

Feature	Supported	Notes
`/v1/chat/completions`	Yes	Both streaming and non-streaming
`/v1/completions`	No	Not supported by Bedrock OpenAI API
`/v1/embeddings`	No	Use a different embedding provider
Tool calling	No	Bedrock's endpoint doesn't support `tools` parameter

Overview​

How It Works​

Request Hashing​

Client Interception​

Storage Architecture​

Recording Modes​

LIVE Mode​

RECORD Mode​

REPLAY Mode​

Streaming Support​

The Problem​

The Solution​

Serialization​

Environment Integration​

Environment Variables​

Pytest Integration​

Debugging Recordings​

Inspecting Storage​

Common Issues​

Design Decisions​

Why Not Mocks?​

Why Precise Hashing?​

Why JSON + SQLite?​

Provider-Specific: AWS Bedrock​

Prerequisites​

Generate Bearer Token​

Stack Configuration​

Recording Bedrock Tests​

Verifying Replay​

Bedrock API Limitations​