You Cannot Debug What You Cannot See
In traditional software, observability is about metrics, logs, and traces. You instrument the code, collect the data, and visualize what is happening inside a running system. This discipline is mature and well-understood. For agentic systems, the same discipline applies — but the thing you are observing is fundamentally different: you are observing reasoning. The agent's decision-making process, its tool call choices, its confidence, its interpretation of the system prompt. Observing that requires new instrumentation at the semantic level.
A traditional microservice either returns a 200 or a 500. You can trace the call stack. You can replay the exact request. Agent behavior has none of that determinism. The same input may produce different outputs on different runs. The "bug" may not be an exception — it may be subtle semantic drift: the agent technically answered the question, but the answer is slightly wrong, slightly off-tone, slightly too verbose. That kind of degradation is invisible without semantic observability tooling.
Multi-agent systems compound the difficulty. If a five-agent pipeline produces a wrong final output, which agent is responsible? Without distributed tracing that links all agents by a shared trace_id, you cannot answer that question.
The four-layer stack separates what the business cares about (Layer 4) from what the infrastructure measures (Layer 1). Alerts trigger at the business layer; root-cause analysis drills downward through each layer.
The reasoning trace is the most valuable and most unique layer of agentic observability. It captures not just what the agent said, but why it made specific choices. This requires explicit prompt instrumentation: instruct the agent to output its reasoning in a structured format that can be extracted and stored separately from the user-facing response.
SYSTEM_PROMPT_WITH_TRACE = """
You are a customer service agent. Before responding, reason through
the problem. Structure your complete output as:
<reasoning>
1. What is the customer asking? (classify intent)
2. What information do I need from tools?
3. What are the constraints from the system prompt?
4. What is the appropriate response format?
</reasoning>
<output>
[Your response to the customer here]
</output>
"""
def extract_reasoning_trace(full_response: str) -> tuple[str, str]:
"""Separates reasoning from user-facing output."""
reasoning_match = re.search(r'<reasoning>(.*?)</reasoning>',
full_response, re.DOTALL)
output_match = re.search(r'<output>(.*?)</output>',
full_response, re.DOTALL)
reasoning = reasoning_match.group(1).strip() if reasoning_match else ""
output = output_match.group(1).strip() if output_match else full_response
return reasoning, output
Every tool invocation should be logged with the full input, the full output, the latency, and the trace ID. A TracedToolWrapper decorates each tool with automatic logging, without changing the tool's implementation.
class TracedToolWrapper:
def __init__(self, tool_fn, trace_store: TraceStore):
self.tool_fn = tool_fn
self.trace_store = trace_store
async def __call__(self, trace_id: str, **kwargs) -> Any:
start_time = time.time()
result = await self.tool_fn(**kwargs)
latency_ms = (time.time() - start_time) * 1000
self.trace_store.log_tool_call({
"trace_id": trace_id,
"tool": self.tool_fn.__name__,
"input": kwargs,
"output": result,
"latency_ms":latency_ms,
"timestamp": datetime.utcnow().isoformat(),
})
return result
Semantic drift is the gradual degradation of agent behavior over time — often invisible in individual interactions but detectable in aggregate. Causes: model provider updates to the underlying model version; incremental system prompt edits that collectively shift behavior; distribution shift in user queries (the real-world traffic diverges from the golden dataset). The fix is a continuous evaluation loop that runs the golden dataset against the production agent on a schedule and alerts when pass rates decline.
async def run_online_eval_sample(
agent, sample_size: int = 50
) -> EvalReport:
"""Randomly samples from golden dataset and runs online evaluation."""
sample = random.sample(GOLDEN_DATASET, sample_size)
results = await run_eval_suite(agent, sample)
report = EvalReport(
timestamp = datetime.utcnow(),
pass_rate = results.pass_count / sample_size,
failed_cases = results.failed_cases,
)
# Alert if rolling pass rate drops more than 5% from baseline
rolling_rate = compute_rolling_pass_rate(window_days=7)
if rolling_rate < BASELINE_PASS_RATE - 0.05:
alert.send(
severity = "HIGH",
message = f"Agent pass rate drifted: {rolling_rate:.1%} "
f"(baseline: {BASELINE_PASS_RATE:.1%})"
)
return report
For production agentic systems, especially those taking real-world actions, a complete audit trail is not optional — it is a compliance requirement. Every inference that leads to a real-world action must be traceable: who asked it, what the agent was told, what it reasoned, what it called, and what it returned.
| Audit Field | Description | Security Value |
|---|---|---|
run_id | Unique identifier for this agent invocation | Links all events in one run |
trace_id | Shared ID across all agents in a multi-agent workflow | Traces causality across agents |
timestamp | UTC timestamp of the invocation | Enables time-based correlation |
user_id | Authenticated user who triggered the invocation | Attribution for all actions |
intent | Classified intent (from semantic router) | Audit trail for routing decisions |
prompt_version | System prompt version identifier | Correlates behavior with specific prompts |
model | Model name + version used | Correlates behavior with model versions |
tool_calls | Full list of tool calls with inputs and outputs | Post-hoc detection of injection attacks |
reasoning_trace | Extracted chain-of-thought reasoning | Root cause analysis for wrong output |
A distributed trace gives each agent call, tool invocation, and retrieval step its own span with timing, token count, and outcome. The full trace reconstructs exactly what happened, in what order, and at what cost.
You cannot debug what you cannot see. Observability is not a luxury — it is the prerequisite for production-grade agentic systems. Instrument reasoning traces from day one, before you need to debug a failure in production.