Chapter 15

Observability: Tracing Agent Reasoning

You Cannot Debug What You Cannot See

Part VI — The New Security and Observability 6 sections

In traditional software, observability is about metrics, logs, and traces. You instrument the code, collect the data, and visualize what is happening inside a running system. This discipline is mature and well-understood. For agentic systems, the same discipline applies — but the thing you are observing is fundamentally different: you are observing reasoning. The agent's decision-making process, its tool call choices, its confidence, its interpretation of the system prompt. Observing that requires new instrumentation at the semantic level.


15.1   Why Agents Are Hard to Observe

A traditional microservice either returns a 200 or a 500. You can trace the call stack. You can replay the exact request. Agent behavior has none of that determinism. The same input may produce different outputs on different runs. The "bug" may not be an exception — it may be subtle semantic drift: the agent technically answered the question, but the answer is slightly wrong, slightly off-tone, slightly too verbose. That kind of degradation is invisible without semantic observability tooling.

Multi-agent systems compound the difficulty. If a five-agent pipeline produces a wrong final output, which agent is responsible? Without distributed tracing that links all agents by a shared trace_id, you cannot answer that question.


15.2   Four Observability Layers

Fig 15.1 — Four-Layer Agent Observability Stack
Layer 1 — Infrastructure Metrics Latency per call | Token usage / cost | API error rates | Model availability Layer 2 — Tool & Retrieval Observability Tool call success / failure rate | Retrieved chunk relevance score | Tool selection distribution Retrieval recall @k | Tool latency histogram Layer 3 — Agent Behaviour Metrics Loop iteration count per task | Goal completion rate | Hallucination rate (eval-scored) Context window utilisation | Stop condition trigger distribution Layer 4 — Business Outcome Metrics Task success rate (eval-graded) | User satisfaction score | Time to acceptable output False positive / negative rate on high-stakes decisions | Business KPI impact BUSINESS BEHAVIOUR TOOLS INFRA Alert at Layer 4 → diagnose in Layer 3 → isolate in Layer 2 → trace to Layer 1

The four-layer stack separates what the business cares about (Layer 4) from what the infrastructure measures (Layer 1). Alerts trigger at the business layer; root-cause analysis drills downward through each layer.


15.3   Implementing Reasoning Traces

The reasoning trace is the most valuable and most unique layer of agentic observability. It captures not just what the agent said, but why it made specific choices. This requires explicit prompt instrumentation: instruct the agent to output its reasoning in a structured format that can be extracted and stored separately from the user-facing response.

Chain-of-Thought Logging

System Prompt with Structured Trace Output
SYSTEM_PROMPT_WITH_TRACE = """
You are a customer service agent. Before responding, reason through 
the problem. Structure your complete output as:

<reasoning>
1. What is the customer asking? (classify intent)
2. What information do I need from tools?
3. What are the constraints from the system prompt?
4. What is the appropriate response format?
</reasoning>
<output>
[Your response to the customer here]
</output>
"""

def extract_reasoning_trace(full_response: str) -> tuple[str, str]:
    """Separates reasoning from user-facing output."""
    reasoning_match = re.search(r'<reasoning>(.*?)</reasoning>',
                                full_response, re.DOTALL)
    output_match    = re.search(r'<output>(.*?)</output>',
                                full_response, re.DOTALL)
    reasoning = reasoning_match.group(1).strip() if reasoning_match else ""
    output    = output_match.group(1).strip()    if output_match    else full_response
    return reasoning, output

Tool Call Tracing

Every tool invocation should be logged with the full input, the full output, the latency, and the trace ID. A TracedToolWrapper decorates each tool with automatic logging, without changing the tool's implementation.

TracedToolWrapper
class TracedToolWrapper:
    def __init__(self, tool_fn, trace_store: TraceStore):
        self.tool_fn     = tool_fn
        self.trace_store = trace_store
    
    async def __call__(self, trace_id: str, **kwargs) -> Any:
        start_time = time.time()
        result     = await self.tool_fn(**kwargs)
        latency_ms = (time.time() - start_time) * 1000

        self.trace_store.log_tool_call({
            "trace_id":  trace_id,
            "tool":      self.tool_fn.__name__,
            "input":     kwargs,
            "output":    result,
            "latency_ms":latency_ms,
            "timestamp": datetime.utcnow().isoformat(),
        })
        return result

15.4   Semantic Drift Detection

Semantic drift is the gradual degradation of agent behavior over time — often invisible in individual interactions but detectable in aggregate. Causes: model provider updates to the underlying model version; incremental system prompt edits that collectively shift behavior; distribution shift in user queries (the real-world traffic diverges from the golden dataset). The fix is a continuous evaluation loop that runs the golden dataset against the production agent on a schedule and alerts when pass rates decline.

Online Evaluation with Drift Alert
async def run_online_eval_sample(
    agent, sample_size: int = 50
) -> EvalReport:
    """Randomly samples from golden dataset and runs online evaluation."""
    sample   = random.sample(GOLDEN_DATASET, sample_size)
    results  = await run_eval_suite(agent, sample)
    
    report = EvalReport(
        timestamp    = datetime.utcnow(),
        pass_rate    = results.pass_count / sample_size,
        failed_cases = results.failed_cases,
    )
    
    # Alert if rolling pass rate drops more than 5% from baseline
    rolling_rate = compute_rolling_pass_rate(window_days=7)
    if rolling_rate < BASELINE_PASS_RATE - 0.05:
        alert.send(
            severity = "HIGH",
            message  = f"Agent pass rate drifted: {rolling_rate:.1%} "
                       f"(baseline: {BASELINE_PASS_RATE:.1%})"
        )
    return report

15.5   Agent Audit Trail

For production agentic systems, especially those taking real-world actions, a complete audit trail is not optional — it is a compliance requirement. Every inference that leads to a real-world action must be traceable: who asked it, what the agent was told, what it reasoned, what it called, and what it returned.

Audit FieldDescriptionSecurity Value
run_idUnique identifier for this agent invocationLinks all events in one run
trace_idShared ID across all agents in a multi-agent workflowTraces causality across agents
timestampUTC timestamp of the invocationEnables time-based correlation
user_idAuthenticated user who triggered the invocationAttribution for all actions
intentClassified intent (from semantic router)Audit trail for routing decisions
prompt_versionSystem prompt version identifierCorrelates behavior with specific prompts
modelModel name + version usedCorrelates behavior with model versions
tool_callsFull list of tool calls with inputs and outputsPost-hoc detection of injection attacks
reasoning_traceExtracted chain-of-thought reasoningRoot cause analysis for wrong output

15.6   Distributed Tracing in Multi-Agent Workflows

Fig 15.2 — Distributed Trace: Multi-Agent Workflow
trace_id: abc123 | request_id: req456 | user: bap@co | started: 09:42:11.234 span[orchestrator] duration=4821ms model=gpt-4o tokens=1240 4821ms span[rag.retrieve] duration=312ms query="Q3 risk" chunks_returned=8 top_score=0.91 span[tool.sql_query] duration=88ms rows=142 status=OK span[agent.analysis] duration=2100ms model=claude-3 tokens=3400 loops=3 2100ms span[reflexion.retry] attempt=2 prev_score=2.8 reason="missing APAC breakdown" span[agent.format] duration=321ms output_tokens=620 format=pdf_ready trace completed | status=SUCCESS | total_tokens=5262 | total_cost=$0.034 | eval_score=4.2/5 Each span is emitted by instrumented agent code — enabling replay, cost attribution, and per-span latency analysis

A distributed trace gives each agent call, tool invocation, and retrieval step its own span with timing, token count, and outcome. The full trace reconstructs exactly what happened, in what order, and at what cost.

Core Principle — Chapter 15

You cannot debug what you cannot see. Observability is not a luxury — it is the prerequisite for production-grade agentic systems. Instrument reasoning traces from day one, before you need to debug a failure in production.