Chapter 13

The Debugging Mindset

Debug the Prompt, Not the Model

Part V — The New Engineering Practices 7 sections

When a traditional program produces wrong output, you debug the code. The compiler is not the problem — the code is. The same principle applies to agentic systems: when an agent produces wrong output, you debug the prompt. The model is not the problem — the prompt is.

This mindset shift is often the hardest adjustment for experienced engineers. The urge to blame the model is strong, especially when behavior is inconsistent. But inconsistency is also a prompt engineering problem: an underspecified prompt produces inconsistent outputs. The fix is a more precise prompt, not a different model.


13.1   The Compiler Doesn't Have Bugs

A veteran developer does not blame the compiler when code does not behave as expected. They inspect the code. The compiler's job is to translate code to instructions faithfully — and it does that faithfully. If the output is wrong, the input (the code) is wrong.

An LLM's job is to predict the most probable next token given context — and it does that faithfully. If the output is wrong, the input (the prompt) is wrong — or the context is ambiguous, or the constraints are missing, or the model was asked to do something beyond its actual capability. In the first three cases, fix the prompt. In the fourth, fix the architecture.

Traditional DebuggingAgentic Debugging
"There's a bug in the code""There's an ambiguity in the prompt"
Read the stack traceRead the reasoning trace
Check variable values at failure pointCheck injected context at failure point
Write a unit test that reproduces the bugWrite an eval case that reproduces the failure

13.2   Five Failure Categories

Most agent misbehaviors fall into five categories. Each has a diagnostic signal and a fix:

1. Vague Instruction

The prompt says what to do but not how. The model fills the gap with its best guess — which may not match your intent.

Before (Vague)
Summarize the customer's issue.
After (Precise)
Summarize the customer's issue in exactly 2 sentences.
First sentence: state the category (refund / shipping / order status / other).
Second sentence: state the specific concern in neutral language.

2. Missing Constraint

The prompt specifies the happy path but does not specify what to do when conditions are not met. The model attempts to be helpful and exceeds its authority.

Before (Missing Constraint)
Process refund requests for orders within 30 days of delivery.
After (Constraint Added)
Process refund requests ONLY for orders delivered within the past 30 days.
If the order is older than 30 days: do NOT process the refund.
Tell the customer the policy window has expired.
Offer to escalate to a human agent if they wish to appeal.

3. Conflicting Instructions

The prompt contains two instructions that can conflict in certain cases. The model resolves the conflict however it sees fit — which may not be what you intended.

Before (Conflicting)
Be concise. Provide a comprehensive explanation of all options available to the customer.
After (Priority Explicit)
Prioritize conciseness: default to 2–3 sentences.
If the customer asks for more detail or options, then provide a comprehensive explanation.

4. No Example Provided

The prompt describes the desired format in words, but the model still guesses at the specifics. Few-shot examples are worth more than paragraphs of description.

After (With Example)
Format your response exactly like this example:
STATUS: Shipped
ORDER: ORD-123
ETA: January 15, 2024
MESSAGE: Your order is on its way.

5. Context Overload

The injected context is too long or too noisy. The relevant information is buried. The model attends to the wrong parts (Chapter 3: lost-in-the-middle problem) and produces an answer based on irrelevant context.

Fix: Prioritize Context
## RELEVANT ORDER DETAILS (use these for your response):
Order ID: ORD-789
Status: Processing
Estimated Ship Date: January 18, 2024

## BACKGROUND (lower priority — only reference if directly asked):
Customer history: 3 prior orders, all delivered successfully
Account tier: Standard

13.3   The Debugging Process

Fig 13.1 — Agentic Debugging Flowchart
Agent produces wrong / unexpected output Is failure reproducible? No Run × 5–10 Measure fail rate If rate >15%: treat as bug Yes Which Trilogy layer is failing? Prompt Prompt Debug 1. Isolate in playground 2. Add explicit examples 3. Add output schema 4. Eval against golden set LLM LLM Debug 1. Try different model 2. Adjust temperature 3. Reduce context size 4. Check token budget Data Data Debug 1. Print raw retrieval 2. Check chunk overlap 3. Verify embeddings 4. Test with static ctx Apply fix → run eval suite → confirm pass Add failing case to regression set — never lose it

Agentic debugging starts with reproducibility, then isolates the failing Trilogy layer. The most important habit: every diagnosed failure gets added to the regression eval set so it can never silently return.


13.4   Prompt Versioning

Prompts are code. Like code, they should be version-controlled, and changes should be traceable. A minimal prompt versioning practice:

This practice transforms prompt debugging from "trying random changes and hoping" into engineering: hypothesize, change, measure, decide. The eval is the instrument. The version log is the experiment record.


13.5   LLM as Debugging Partner

One of the most underused debugging techniques: ask a different LLM session to diagnose your prompt. Feed it the system prompt, the failing eval case, and the wrong output, and ask: "Why might this prompt produce this output? What ambiguity or missing instruction could cause this specific failure?" LLMs are good at identifying ambiguity in language — they do it by default as part of their inference. Use that capability deliberately.

Debugging Prompt Template:

I am debugging an AI agent that produced unexpected output. Please identify what in the system prompt may have caused this behavior.

SYSTEM PROMPT: [paste prompt here]
USER MESSAGE: [paste failing input here]
ACTUAL OUTPUT: [paste wrong response here]
EXPECTED OUTPUT: [paste what it should have said here]

Identify the most likely cause from these categories: vague instruction / missing constraint / conflicting instructions / no example / context overload. Suggest a minimal targeted fix.

13.6   Blame Distribution Matrix

Failure LocationFrequencyDiagnostic SignalFix
System prompt (instruction quality)~60%Consistent wrong behavior on specific input typeAdd precision, constraints, or examples
Injected context (quality/relevance)~20%Response references wrong order / wrong policyImprove retrieval, add context prioritization headers
Tool schema (description clarity)~10%Agent calls wrong tool or wrong parametersImprove tool description and parameter naming
Model capability gap~8%Fails even with perfect prompt on every attemptDecompose task; change model; add reasoning step
Genuine model bug~2%Reproducible failure with no prompt explanationReport to model provider; use different model version

13.7   Chapter Summary

Debugging an agent is a disciplined process: isolate, classify, fix specifically, measure. The model is not the starting hypothesis — the prompt is. Ninety percent of agent failures can be fixed without changing the model. The other ten percent require architectural changes or a different model — but you only know which category you are in after you have exhausted the prompt debugging process.

Core Principle — Chapter 13

When your agent misbehaves, you debug the prompt, not the model. The prompt is the program. Start there.