Chapter 15 — Observability: Tracing Agent Reasoning

In traditional software, observability is about metrics, logs, and traces. You instrument the code, collect the data, and visualize what is happening inside a running system. This discipline is mature and well-understood. For agentic systems, the same discipline applies — but the thing you are observing is fundamentally different: you are observing reasoning. The agent's decision-making process, its tool call choices, its confidence, its interpretation of the system prompt. Observing that requires new instrumentation at the semantic level.

15.1 Why Agents Are Hard to Observe

A traditional microservice either returns a 200 or a 500. You can trace the call stack. You can replay the exact request. Agent behavior has none of that determinism. The same input may produce different outputs on different runs. The "bug" may not be an exception — it may be subtle semantic drift: the agent technically answered the question, but the answer is slightly wrong, slightly off-tone, slightly too verbose. That kind of degradation is invisible without semantic observability tooling.

Multi-agent systems compound the difficulty. If a five-agent pipeline produces a wrong final output, which agent is responsible? Without distributed tracing that links all agents by a shared trace_id, you cannot answer that question.

15.2 Four Observability Layers

Fig 15.1 — Four-Layer Agent Observability Stack

The four-layer stack separates what the business cares about (Layer 4) from what the infrastructure measures (Layer 1). Alerts trigger at the business layer; root-cause analysis drills downward through each layer.

15.3 Implementing Reasoning Traces

The reasoning trace is the most valuable and most unique layer of agentic observability. It captures not just what the agent said, but why it made specific choices. This requires explicit prompt instrumentation: instruct the agent to output its reasoning in a structured format that can be extracted and stored separately from the user-facing response.

Chain-of-Thought Logging

System Prompt with Structured Trace Output

SYSTEM_PROMPT_WITH_TRACE = """
You are a customer service agent. Before responding, reason through 
the problem. Structure your complete output as:

<reasoning>
1. What is the customer asking? (classify intent)
2. What information do I need from tools?
3. What are the constraints from the system prompt?
4. What is the appropriate response format?
</reasoning>
<output>
[Your response to the customer here]
</output>
"""

def extract_reasoning_trace(full_response: str) -> tuple[str, str]:
    """Separates reasoning from user-facing output."""
    reasoning_match = re.search(r'<reasoning>(.*?)</reasoning>',
                                full_response, re.DOTALL)
    output_match    = re.search(r'<output>(.*?)</output>',
                                full_response, re.DOTALL)
    reasoning = reasoning_match.group(1).strip() if reasoning_match else ""
    output    = output_match.group(1).strip()    if output_match    else full_response
    return reasoning, output

Tool Call Tracing

Every tool invocation should be logged with the full input, the full output, the latency, and the trace ID. A TracedToolWrapper decorates each tool with automatic logging, without changing the tool's implementation.

TracedToolWrapper

class TracedToolWrapper:
    def __init__(self, tool_fn, trace_store: TraceStore):
        self.tool_fn     = tool_fn
        self.trace_store = trace_store
    
    async def __call__(self, trace_id: str, **kwargs) -> Any:
        start_time = time.time()
        result     = await self.tool_fn(**kwargs)
        latency_ms = (time.time() - start_time) * 1000

        self.trace_store.log_tool_call({
            "trace_id":  trace_id,
            "tool":      self.tool_fn.__name__,
            "input":     kwargs,
            "output":    result,
            "latency_ms":latency_ms,
            "timestamp": datetime.utcnow().isoformat(),
        })
        return result

15.4 Semantic Drift Detection

Semantic drift is the gradual degradation of agent behavior over time — often invisible in individual interactions but detectable in aggregate. Causes: model provider updates to the underlying model version; incremental system prompt edits that collectively shift behavior; distribution shift in user queries (the real-world traffic diverges from the golden dataset). The fix is a continuous evaluation loop that runs the golden dataset against the production agent on a schedule and alerts when pass rates decline.

Online Evaluation with Drift Alert

async def run_online_eval_sample(
    agent, sample_size: int = 50
) -> EvalReport:
    """Randomly samples from golden dataset and runs online evaluation."""
    sample   = random.sample(GOLDEN_DATASET, sample_size)
    results  = await run_eval_suite(agent, sample)
    
    report = EvalReport(
        timestamp    = datetime.utcnow(),
        pass_rate    = results.pass_count / sample_size,
        failed_cases = results.failed_cases,
    )
    
    # Alert if rolling pass rate drops more than 5% from baseline
    rolling_rate = compute_rolling_pass_rate(window_days=7)
    if rolling_rate < BASELINE_PASS_RATE - 0.05:
        alert.send(
            severity = "HIGH",
            message  = f"Agent pass rate drifted: {rolling_rate:.1%} "
                       f"(baseline: {BASELINE_PASS_RATE:.1%})"
        )
    return report

15.5 Agent Audit Trail

For production agentic systems, especially those taking real-world actions, a complete audit trail is not optional — it is a compliance requirement. Every inference that leads to a real-world action must be traceable: who asked it, what the agent was told, what it reasoned, what it called, and what it returned.

Audit Field	Description	Security Value
`run_id`	Unique identifier for this agent invocation	Links all events in one run
`trace_id`	Shared ID across all agents in a multi-agent workflow	Traces causality across agents
`timestamp`	UTC timestamp of the invocation	Enables time-based correlation
`user_id`	Authenticated user who triggered the invocation	Attribution for all actions
`intent`	Classified intent (from semantic router)	Audit trail for routing decisions
`prompt_version`	System prompt version identifier	Correlates behavior with specific prompts
`model`	Model name + version used	Correlates behavior with model versions
`tool_calls`	Full list of tool calls with inputs and outputs	Post-hoc detection of injection attacks
`reasoning_trace`	Extracted chain-of-thought reasoning	Root cause analysis for wrong output

15.6 Distributed Tracing in Multi-Agent Workflows

Fig 15.2 — Distributed Trace: Multi-Agent Workflow

A distributed trace gives each agent call, tool invocation, and retrieval step its own span with timing, token count, and outcome. The full trace reconstructs exactly what happened, in what order, and at what cost.

Core Principle — Chapter 15

You cannot debug what you cannot see. Observability is not a luxury — it is the prerequisite for production-grade agentic systems. Instrument reasoning traces from day one, before you need to debug a failure in production.