Chapter 3

The Context Window

RAM of the AI Age

Part II — The New Memory 7 sections

In traditional computing, RAM — Random Access Memory — holds the current state of execution: fast, limited, and volatile. It disappears when the process ends and must be managed carefully to avoid waste or exhaustion.

In agentic systems, RAM is the context window: the total number of tokens the model can process in a single call. Everything the agent "knows" during any given inference pass must fit within this window. What falls outside it is, from the model's perspective, as if it never existed.


3.1   What is the Context Window?

The context window is defined as the maximum number of tokens a model can receive as input (and produce as output) in a single inference call. A token is roughly 0.75 words in English — so a 128,000-token context window corresponds to approximately 96,000 words, or a short novel.

Model GenerationApproximate Context Limit
Early GPT-3 (2020)4,096 tokens
GPT-4 (2023)8,192 – 32,768 tokens
GPT-4 Turbo / Claude 2 (2023)128,000 tokens
Gemini 1.5 Pro (2024)1,000,000 tokens
Gemini 2.0 / leading models (2025)1,000,000 – 2,000,000 tokens
Fig 3.1 — Context Window as RAM: The Space Budget
Context Window — 128k tokens System Prompt (~2k tokens) — "The OS Layer" Identity • Constraints • Tool definitions • Output format Retrieved Documents / Injected Data (~30k tokens) "The Heap" — dynamically allocated per request RAG chunks • Tool outputs • External data Largest and most variable region Conversation History (~20k tokens) "The Stack" — grows with each turn, NOT garbage-collected Prior user messages • Prior agent responses Tool Outputs / Intermediate Reasoning (~10k tokens) "The Registers" — transient, consumed during computation Available buffer — headroom before truncation Fixed cost per call Variable — design carefully ⚠ Grows unbounded Transient per cycle Manage to preserve this

The context window is not empty space — it is a budget with named regions. Every token allocation is an engineering decision. Over-allocating any region shrinks the buffer and risks truncation.

Why Volatile?

RAM contents are lost when a process terminates. Context window contents are lost when the conversation session ends. The model has no memory of previous calls unless that history is explicitly included in the current prompt's context. Even as context windows grow to millions of tokens, the volatility remains: the model's "memory" exists only within the bounds of the active context.


3.2   What Lives in the Context Window?

Not all tokens in a context window are created equal. Just as RAM is partitioned into stack, heap, static data, and kernel space, the context window has functional regions.

The System Prompt — The OS Layer. Defines identity, behavioral constraints, available tools, and output format requirements. Always present, making every token a fixed overhead cost on every invocation.

Conversation History — The Stack. The history of prior messages grows with each turn. Unlike a call stack, conversation history is not automatically garbage-collected. Left unmanaged, it will grow until it hits the context limit.

Retrieved Documents and Injected Data — The Heap. When an agent retrieves documents via RAG or receives structured data through tool calls, that data is injected into the context dynamically. The largest and most variable region of the context window.

Intermediate Reasoning — The Registers. When a model uses chain-of-thought reasoning or operates within a ReAct loop, it generates intermediate reasoning text. Transient, actively used during computation, not persisted after computation completes.


3.3   The "Lost in the Middle" Problem

In 2023, a significant research finding emerged: models perform significantly worse when critical information is placed in the middle of a long context, compared to the beginning or end. This "bathtub curve" of attention has direct practical implications for prompt design.

Fig 3.2 — The “Lost in the Middle” Effect
High Mid Low Attention Weight Start Middle of Context End Position in Context Window → ✓ System Prompt ⚠ Critical info here = lost ✓ Recent context

LLMs pay significantly less attention to content placed in the middle of a long context. Put critical instructions at the start of the system prompt or at the end of the user message — never buried in the middle of retrieved chunks.

Place the most critical instructions at the beginning of the system prompt. Do not bury "you must always respond in JSON" in paragraph 7 of a 20-paragraph system prompt. Place the most relevant data near the end of the context (just before the user query), where the model's attention is strongest. Avoid the middle for anything load-bearing.

This is not a workaround for a bug — it is a permanent architectural characteristic of the attention mechanism.


3.4   Context Window as a Budget

The appropriate mental model for the context window is not "a large space to fill" but "a limited budget to allocate." Every token added to the context has a direct financial cost, a cognitive cost (attention degrades at scale), and a latency cost.

Fig 3.3 — Explicit Token Budget Allocation
Total Budget: 128,000 tokens System 2k Retrieved Documents 30k tokens MAX Conversation History 20k tokens MAX Tool Output 10k MAX Reserve / Response Buffer ~66k tokens 1.6% 23.4% 15.6% 7.8% 51.6% Budget Principles: ① Set hard limits per region — enforce them in the retrieval layer, not in the prompt ② Reserve buffer for the model’s output (~2× expected response length) ③ Conversation history is the most dangerous unbounded region — prune proactively

A token budget is not about frugality — it is about predictability. Systems that enforce token region limits behave consistently; systems that do not degrade unpredictably as conversations grow.

Token Budget Configuration
config = {
    "max_system_prompt_tokens": 3000,
    "max_retrieved_docs_tokens": 15000,
    "max_history_tokens": 8000,
    "max_output_tokens": 5000,
}

3.5   Long Context vs. Short Context Models

The case for long context: For tasks that genuinely require reasoning across a large body of material — auditing an entire codebase, reading a full financial filing — long context models allow the agent to hold everything in "working memory" simultaneously.

The case for short context + RAG: For tasks involving a large corpus where only a fraction is needed per query, RAG is more precise, more cost-efficient, and often more accurate. Retrieving the right 5 pages from a 500-page manual is faster, cheaper, and less likely to cause "lost in the middle" degradation than injecting all 500 pages.

The practical answer: Use the smallest context that reliably supports the task. Long context windows are a safety net — not a license to abandon context discipline.


3.6   Strategies for Context Management

Context management is the closest agentic programming has to memory management in traditional systems.

Chunking: Break large documents into smaller, semantically coherent pieces that can be independently retrieved. Chunk size is a tunable parameter — chunks too small lose context; chunks too large dilute precision.

Summarization Compression: When conversation history grows too long, run a summarization call that compresses 50 previous turns into a 300-token state block. The compressed block replaces the full history in subsequent calls.

Rolling Summary Window
WINDOW_SIZE = 10
SUMMARY_TRIGGER = 20

if len(conversation_history) > SUMMARY_TRIGGER:
    oldest_turns = conversation_history[:WINDOW_SIZE]
    summary = llm.call(f"Summarize these {WINDOW_SIZE} turns:\n{oldest_turns}")
    conversation_history = (
        [{"role": "system", "content": f"[Summary]: {summary}"}]
        + conversation_history[WINDOW_SIZE:]
    )

Context Refresh: When context rot is detected (Chapter 4), start a fresh session with a compact "state handoff" prompt that includes the task objective, decisions made, data gathered, and next steps.

Selective Injection: The most disciplined approach — never inject data into the context "just in case." Inject only what is needed for this specific step of this specific task.


3.7   Chapter Summary

The context window is to agentic programming what RAM is to traditional computing: a fast, finite, volatile working space that shapes everything about how computation proceeds. Larger context windows reduce the urgency of context management — but they do not eliminate it. Attention quality, cost, and latency still scale with context size.

Core Principle — Chapter 3

The context window is not a dumping ground — it is a budget. Every token you add to it is a choice. Every token you leave out is also a choice. Great agentic programming is, in large part, great context management.