Debug the Prompt, Not the Model
When a traditional program produces wrong output, you debug the code. The compiler is not the problem — the code is. The same principle applies to agentic systems: when an agent produces wrong output, you debug the prompt. The model is not the problem — the prompt is.
This mindset shift is often the hardest adjustment for experienced engineers. The urge to blame the model is strong, especially when behavior is inconsistent. But inconsistency is also a prompt engineering problem: an underspecified prompt produces inconsistent outputs. The fix is a more precise prompt, not a different model.
A veteran developer does not blame the compiler when code does not behave as expected. They inspect the code. The compiler's job is to translate code to instructions faithfully — and it does that faithfully. If the output is wrong, the input (the code) is wrong.
An LLM's job is to predict the most probable next token given context — and it does that faithfully. If the output is wrong, the input (the prompt) is wrong — or the context is ambiguous, or the constraints are missing, or the model was asked to do something beyond its actual capability. In the first three cases, fix the prompt. In the fourth, fix the architecture.
| Traditional Debugging | Agentic Debugging |
|---|---|
| "There's a bug in the code" | "There's an ambiguity in the prompt" |
| Read the stack trace | Read the reasoning trace |
| Check variable values at failure point | Check injected context at failure point |
| Write a unit test that reproduces the bug | Write an eval case that reproduces the failure |
Most agent misbehaviors fall into five categories. Each has a diagnostic signal and a fix:
The prompt says what to do but not how. The model fills the gap with its best guess — which may not match your intent.
Summarize the customer's issue.
Summarize the customer's issue in exactly 2 sentences.
First sentence: state the category (refund / shipping / order status / other).
Second sentence: state the specific concern in neutral language.
The prompt specifies the happy path but does not specify what to do when conditions are not met. The model attempts to be helpful and exceeds its authority.
Process refund requests for orders within 30 days of delivery.
Process refund requests ONLY for orders delivered within the past 30 days.
If the order is older than 30 days: do NOT process the refund.
Tell the customer the policy window has expired.
Offer to escalate to a human agent if they wish to appeal.
The prompt contains two instructions that can conflict in certain cases. The model resolves the conflict however it sees fit — which may not be what you intended.
Be concise. Provide a comprehensive explanation of all options available to the customer.
Prioritize conciseness: default to 2–3 sentences.
If the customer asks for more detail or options, then provide a comprehensive explanation.
The prompt describes the desired format in words, but the model still guesses at the specifics. Few-shot examples are worth more than paragraphs of description.
Format your response exactly like this example:
STATUS: Shipped
ORDER: ORD-123
ETA: January 15, 2024
MESSAGE: Your order is on its way.
The injected context is too long or too noisy. The relevant information is buried. The model attends to the wrong parts (Chapter 3: lost-in-the-middle problem) and produces an answer based on irrelevant context.
## RELEVANT ORDER DETAILS (use these for your response):
Order ID: ORD-789
Status: Processing
Estimated Ship Date: January 18, 2024
## BACKGROUND (lower priority — only reference if directly asked):
Customer history: 3 prior orders, all delivered successfully
Account tier: Standard
Agentic debugging starts with reproducibility, then isolates the failing Trilogy layer. The most important habit: every diagnosed failure gets added to the regression eval set so it can never silently return.
Prompts are code. Like code, they should be version-controlled, and changes should be traceable. A minimal prompt versioning practice:
SYSTEM_PROMPT_V3_2This practice transforms prompt debugging from "trying random changes and hoping" into engineering: hypothesize, change, measure, decide. The eval is the instrument. The version log is the experiment record.
One of the most underused debugging techniques: ask a different LLM session to diagnose your prompt. Feed it the system prompt, the failing eval case, and the wrong output, and ask: "Why might this prompt produce this output? What ambiguity or missing instruction could cause this specific failure?" LLMs are good at identifying ambiguity in language — they do it by default as part of their inference. Use that capability deliberately.
| Failure Location | Frequency | Diagnostic Signal | Fix |
|---|---|---|---|
| System prompt (instruction quality) | ~60% | Consistent wrong behavior on specific input type | Add precision, constraints, or examples |
| Injected context (quality/relevance) | ~20% | Response references wrong order / wrong policy | Improve retrieval, add context prioritization headers |
| Tool schema (description clarity) | ~10% | Agent calls wrong tool or wrong parameters | Improve tool description and parameter naming |
| Model capability gap | ~8% | Fails even with perfect prompt on every attempt | Decompose task; change model; add reasoning step |
| Genuine model bug | ~2% | Reproducible failure with no prompt explanation | Report to model provider; use different model version |
Debugging an agent is a disciplined process: isolate, classify, fix specifically, measure. The model is not the starting hypothesis — the prompt is. Ninety percent of agent failures can be fixed without changing the model. The other ten percent require architectural changes or a different model — but you only know which category you are in after you have exhausted the prompt debugging process.
When your agent misbehaves, you debug the prompt, not the model. The prompt is the program. Start there.