Skip to content

Token Optimization & Context Handling

Strategies for managing token budgets, compressing context, and handing off state between agents in a multi-agent system. Companion to MODEL_OPTIMIZATION.md (model selection) and ../architecture/multi-agent-orchestration-patterns.md (branch coordination).

Tracking: Issue #42

GHCP-specific: This guidance was developed and tested against GitHub Copilot (GHCP). If you are using Azure OpenAI, Anthropic API, AWS Bedrock, or another provider, model names, tier pricing, and rate limits will differ. See Adapting for Other Providers.


1. Context Window Management Strategies

Every model has a finite context window. Treating it as unlimited leads to degraded output quality, truncated responses, and wasted spend.

Know Your Limits

Model Context Window Effective Limit Notes
claude-opus-4.6 200K tokens ~160K usable Reserve 20% for output generation
claude-sonnet-4.6 200K tokens ~160K usable Same reservation applies
gpt-5.3-codex 200K tokens ~160K usable Code-optimized; large inputs degrade non-code reasoning
claude-haiku-4.5 200K tokens ~160K usable Fast but quality drops sharply past ~80K input
gpt-5.4-mini 128K tokens ~100K usable Budget model; keep inputs under 60K for reliable output

Rule of thumb: Reserve 20% of the context window for output. If you need 40K tokens of output, you have 160K minus 40K = 120K for input.

Layered Context Loading

Load context in priority order rather than dumping everything at once:

  1. System instructions — governance rules, role definition (~1–2K tokens)
  2. Task-specific instructions — the immediate objective (~0.5–1K tokens)
  3. Critical reference files — only files the agent must read to complete the task
  4. Supporting context — examples, history, related docs (load only if budget remains)

Stop loading when you reach 60% of the effective limit. The remaining 40% covers agent reasoning and output.


2. Token Budget Allocation per Agent Role

Align token budgets with the model tier from MODEL_OPTIMIZATION.md:

Agent Role Model Tier Recommended Input Budget Max Output Budget Rationale
architect Premium 80K 40K Deep reasoning needs room for chain-of-thought
security_analyst Premium 80K 40K Must analyze full code paths without truncation
reviewer / code-review Reasoning 60K 20K Diffs + context; output is structured comments
researcher Reasoning 80K 30K May ingest large docs; output is synthesis
backend-dev / frontend-dev Code 60K 40K Code generation needs generous output budget
sprint-planner Reasoning 40K 20K Structured decomposition, not large inputs
merge-coordinator Fast 20K 10K Small diffs, merge commands, status checks
config-auditor / watchdog Fast 30K 5K Scan-and-report; output is a short checklist

Budget Enforcement Pattern

function enforceTokenBudget(prompt, maxInputTokens) {
  const estimated = estimateTokens(prompt);
  if (estimated > maxInputTokens) {
    return compressContext(prompt, maxInputTokens);
  }
  return prompt;
}

3. Prompt Compression Techniques

When context exceeds the budget, compress — don't truncate blindly.

3.1 Summarization

Replace large blocks with concise summaries. Use a fast-tier model (Haiku) to generate summaries before passing them to a higher-tier model.

Before (2,400 tokens):
  Full git diff of 15 files with 200 changed lines

After (400 tokens):
  "Summary: 15 files changed across src/auth/ and src/api/.
   Key changes: JWT validation refactored, rate-limit middleware added,
   3 new API routes (/users, /sessions, /tokens). No deletions."

Cost: One Haiku summarization call (~0.02¢) can save 2,000 tokens on a Sonnet call (~0.6¢).

3.2 Selective Inclusion

Only include what the agent needs for its specific task:

Agent Task Include Exclude
Code review Changed files, diff hunks, test results Unrelated source files, full history
Architecture Interface definitions, dependency graph Implementation bodies, test fixtures
Security scan Auth code, input handling, config UI components, styling, docs
Sprint planning Issue list, velocity data, blockers Source code, CI logs

3.3 Truncation with Markers

When you must truncate, leave breadcrumbs so the agent knows context was removed:

[FILE: src/auth/jwt.ts — 340 lines, showing lines 1-50 and 280-340]
[TRUNCATED: lines 51-279 contain helper functions — request full file if needed]

3.4 Reference by Pointer

Instead of inlining large files, reference them:

For the full API schema, see: docs/api-schema.yaml (420 lines, ~8K tokens)
Key endpoints relevant to this task: POST /auth/login, DELETE /auth/session

4. Context Handoff Between Agents

When one agent's output becomes another agent's input, transfer only what matters.

What to Pass

Data Pass? Format
Task result / deliverable ✅ Always Full output
Decision rationale ✅ Always 2–3 sentence summary
Files created or modified ✅ Always File paths + brief description
Unresolved issues or blockers ✅ Always Structured list
Error messages encountered ✅ If relevant Exact error text
Full conversation history ❌ Never Summarize instead
Intermediate reasoning steps ❌ Never Only pass conclusions
Unchanged reference files ❌ Never Agent can load its own

Handoff Template

## Agent Handoff: {source-role} → {target-role}

### Completed
- {what was done, 1-2 lines each}

### Artifacts
- `path/to/file.ts` — {what it contains}
- `path/to/test.ts` — {test coverage summary}

### Decisions Made
- {decision}: {rationale in one line}

### Open Items
- {anything the next agent must address}

### Context Files Needed
- {only files the next agent should load}

Target size: Handoff documents should be 500–1,500 tokens. If yours exceeds 2K tokens, compress further.

Pipeline Example

Architect (Opus, ~80K input)
  → produces: design doc + handoff (1.2K tokens)

Backend-Dev (Codex, ~60K input)
  → receives: handoff + design doc + relevant source files
  → produces: implementation + handoff (800 tokens)

Reviewer (Sonnet, ~60K input)
  → receives: handoff + diff + test results
  → produces: review comments + approval/rejection

5. Caching Strategies for Repeated Context

Avoid re-reading and re-tokenizing the same content across agent invocations.

System Prompt Caching

Most providers cache system prompts across calls with identical prefixes. Structure prompts so the stable prefix (governance rules, role definition) stays constant:

[CACHED — identical across all calls for this agent role]
  System instructions (governance.instructions.md)
  Role definition (agent .md file)

[VARIABLE — changes per invocation]
  Task-specific context
  File contents
  Conversation history

Savings: Anthropic's prompt caching charges ~10% of normal input cost for cached tokens. A 3K-token system prompt called 50 times saves ~135K billable tokens.

Cross-Agent Shared Context

When multiple agents need the same reference (e.g., a project spec), load it once and pass a summary to subsequent agents rather than having each agent re-read the full document.

Stale Context Invalidation

Cache keys should include: - File content hash (not just path — files change) - Branch name (context differs across branches) - Timestamp with TTL (default: 15 minutes for active sprint work)


6. Measurement and Monitoring

Token Usage Tracking

Track per-agent, per-invocation:

Metric How to Capture Why It Matters
Input tokens Provider API response Cost attribution
Output tokens Provider API response Cost attribution + quality signal
Cache hit tokens Provider API response (if available) Caching effectiveness
Context utilization Input tokens ÷ effective window Over 80% = risk of quality degradation
Compression ratio Original tokens ÷ compressed tokens Compression effectiveness

Cost Estimation Formula

Per-invocation cost =
  (input_tokens × input_price_per_1M / 1,000,000)
  + (output_tokens × output_price_per_1M / 1,000,000)
  - (cached_tokens × cache_discount_per_1M / 1,000,000)

Sprint-Level Budget

Estimate total sprint token usage:

Sprint budget = Σ (agent_invocations × avg_tokens_per_invocation × cost_per_token)

Example for a 10-issue sprint: - 10 planning calls (Sonnet, ~40K input, ~10K output) ≈ 500K tokens - 30 implementation calls (Codex, ~60K input, ~30K output) ≈ 2.7M tokens - 20 review calls (Sonnet, ~40K input, ~10K output) ≈ 1M tokens - 40 automation calls (Haiku, ~20K input, ~5K output) ≈ 1M tokens - Total: ~5.2M tokens per sprint

Alerts

Set thresholds to catch runaway usage: - Single invocation exceeds 150K input tokens → warn - Agent role exceeds 2× its average daily usage → investigate - Sprint total exceeds budget by 20% → pause and review


7. Instruction File Sizing

Instruction files (.instructions.md, .agent.md) are loaded into every invocation. Oversized instructions waste budget on every call.

File Type Target Size Max Size Tokens (est.)
.instructions.md 1–3 KB 5 KB ~500–1,500
.agent.md 2–4 KB 6 KB ~800–2,000
Governance instructions 3–5 KB 8 KB ~1,200–2,500

Sizing Guidelines

  • One concern per file. Split multi-topic instructions into separate files.
  • Link, don't inline. Reference detailed docs by path instead of copying content.
  • Prune examples. One good example beats three redundant ones.
  • Audit quarterly. Remove outdated rules that no longer apply.

8. Practical Examples with Token Counts

Example A: Code Review — Well-Optimized

System prompt (governance + reviewer role):     1,800 tokens
Task instructions:                                 400 tokens
Diff (3 files, 120 lines changed):              2,200 tokens
Test results summary:                              300 tokens
───────────────────────────────────────────────
Total input:                                     4,700 tokens
Output (review comments):                        1,200 tokens
Model: claude-sonnet-4.6
Estimated cost:                                   ~$0.02

Example B: Code Review — Unoptimized

System prompt (full governance + all instructions): 4,500 tokens
Full conversation history (15 turns):              12,000 tokens
All source files in the repo:                      45,000 tokens
Full CI log output:                                 8,000 tokens
───────────────────────────────────────────────
Total input:                                      69,500 tokens
Output (same review comments):                     1,200 tokens
Model: claude-sonnet-4.6
Estimated cost:                                    ~$0.23

Savings: 91% cost reduction by loading only what the agent needs.

Example C: Architect → Backend-Dev Handoff

Architect output (full):                          8,000 tokens
Handoff document (compressed):                    1,200 tokens
Backend-dev loads: handoff + 3 source files:      6,400 tokens
───────────────────────────────────────────────
Backend-dev total input:                          7,600 tokens
Without handoff (re-reads everything):           35,000 tokens
Savings:                                            78%

9. Task Complexity, Turn Budgets, and Adaptive Learning

Agent work has a cost that goes beyond tokens — it also consumes turns. Turns compound: each one adds history to the context window, narrows decision space, and raises the cost of a wrong approach. This section defines how to classify tasks, budget turns, and convert experience into reusable memory.

9.1 Complexity Classification

Classify each task before starting. State the class in your plan or at the top of your first response.

Class Definition Typical indicators Soft turn budget
Routine Pattern fully covered by existing instructions or memory "We have an instruction for this", existing template matches ≤ 3 turns
Familiar Similar to prior work but with new variables or constraints "We've done something like this", partial memory match ≤ 5 turns
Novel No prior coverage; encountering this pattern for the first time No matching instruction, no memory hit, new tool/API/pattern Estimate N turns upfront

For Novel tasks, state the estimated turn cost at task start:

Task: Integrate Azure Service Connector with App Configuration
Class: Novel — no prior Service Connector instruction exists
Estimated turns: 6 (learning cost: ~3 for discovery, ~3 for implementation + validation)

Learning cost amortization: A Novel task that succeeds becomes Familiar the next time. Familiar tasks that are done repeatedly become Routine. Memory is the mechanism — see §9.3.

9.2 Failure Protocol — Stuck After 5 Turns

If a task has consumed more than 5 turns and there has been no measurable forward progress, stop. Do not continue with more of the same approach.

What counts as forward progress?

Signal Counts?
A new test passes that did not pass before ✅ Yes
A new error class is resolved (not just a different message for the same root cause) ✅ Yes
A file reaches its intended target state ✅ Yes
A blocker is identified and removed ✅ Yes
The same error appears with different output ❌ No
More lines of code added without test validation ❌ No
Escalating to a higher-tier model without changing the approach ❌ No

When stuck:

  1. Log to memory. Call store_memory with:
  2. Task description (what was being attempted)
  3. Approach tried (summarize the strategy, not the code)
  4. Failure mode (exact error, logical dead-end, or missing capability)
  5. Blocking signal (what specifically is preventing progress)

  6. Reassess, don't escalate. Change the approach before changing the model tier. Common pivots:

  7. Break the task into smaller independently-verifiable units
  8. Load docs or examples that were previously skipped to save tokens
  9. Invert the approach (bottom-up instead of top-down, or vice versa)
  10. Consult memory for similar prior failures

  11. Escalate model tier only as a last resort, and only when the problem genuinely requires deeper reasoning — not just more attempts.

Early warning: the 80/50 rule

At 80% of your turn budget, if you have less than 50% progress toward the goal, pause and reassess. Don't wait until fully stuck.

Turn budget: 5
Current turn: 4 (80%)
Progress: 1 test passing of 4 required (25%)
→ Pause. Reassess before turn 5.

9.3 Success Protocol — Learning Reinforcement

When a task completes within its turn budget and test validation passes, evaluate whether the solution involved a non-obvious pattern.

Reinforce when: - The solution required a discovery that no existing instruction covers - The approach was non-obvious and could save future agents 2+ turns - A specific error was encountered and resolved in a reusable way

Skip when: - The task followed an existing instruction exactly - The solution is boilerplate (adding a route, writing a standard test, etc.) - The memory fact would duplicate existing instruction content

What to store:

Subject:    <topic area>
Fact:       <pattern or rule, ≤200 chars>
Citations:  <file:line or "User input: ..." or "Validated in task X">
Reason:     <why this will help future tasks; what turns it saves>

Good example:

Subject:    gh-aw compilation
Fact:       gh aw safe-outputs: add-labels and add-comment take no sub-properties.
            allowed-labels belongs under create-issue, not add-labels.
Citations:  .github/workflows/issue-triage.md — Sprint 11 compile iteration
Reason:     This error burned 2 turns on Sprint 11. Any future agentic workflow
            author will hit it immediately without this memory.

Bad example (too generic — skip):

Fact: Use TypeScript for new files.
→ Already in instructions. Skip.

9.4 Turn Accounting Summary

Before task:
  Classify: Routine / Familiar / Novel
  If Novel: state estimated turn budget

During task:
  Track actual turns vs. budget
  At 80% budget / <50% progress: reassess

On failure (>5 turns, no progress):
  store_memory(failure pattern)
  change approach or escalate

On success (within budget + tests pass):
  if novel pattern: store_memory(solution pattern)
  classification → drops one level (Novel→Familiar, Familiar→Routine)

10. From the Field: BaseCoat Sprint Experience

These examples come from BaseCoat's own development sprints. They illustrate how the strategies in §1–9 play out in practice.

Context Loading: The Cost of Speculative Reads

What happened: In early sprints, the explore skill pattern was: read the entire agents/ directory to understand the repo state. At 77 agents (~400 tokens each), this consumed ~30K tokens before any task work started.

What changed: The hot-index pattern. The L2 memory index (memory-index.instructions.md) loads a curated summary (~1,500 tokens) of all key facts about the repo. This replaces the full directory scan for 80% of tasks. The other 20% (Novel tasks that need specific agent details) load only the targeted agent files.

Measured impact: Task startup dropped from ~30K tokens (full scan) to ~3K tokens (hot index + targeted load). On a sprint with 30 agent invocations, this saves ~810K tokens.


Turn Budget: The 5-Turn Recovery Pattern

What happened: During Sprint 19, an agent was tasked with wiring check-coherence.ps1 into CI. It consumed 7 turns and produced no working configuration. The agent kept modifying the workflow YAML — changing indentation, restructuring steps — but the root problem was that the script exits 0 by default and CI never knew it had run with violations.

The 80/50 signal was ignored: At turn 4 (80% of a ≤5 budget), the agent had 1 of 3 required checks passing (33% progress). The correct move was to pause and reassess.

Resolution in turn 8: Changed approach — instead of fixing the workflow, added -Strict to the script call in run-tests.ps1. The same tests that had been passing now surfaced the violations. Three lines of change instead of 50.

Pattern confirmed: At 80% of turn budget with <50% progress, the approach is wrong. More turns with the same approach do not fix this.


Instruction Sizing: When Instructions Become Context Debt

Observation across Sprint 23–24: The governance instruction file (instructions/governance.instructions.md) grew to 8KB+ (estimated ~2,500 tokens). Because it has applyTo: "**/*", it loads on every agent invocation. Across a 30-invocation sprint, the governance instruction alone accounts for ~75K tokens.

What this means: Every 1KB added to a global instruction file costs ~750K tokens per year at BaseCoat's sprint velocity (~300 invocations/month). Global instruction files have a much higher amortized cost than targeted ones.

Rule reinforced: Global instructions (applyTo: "**/*") should contain only invariant rules. Domain-specific guidance belongs in scoped instruction files (applyTo: "agents/**", applyTo: "*.yml").


Compression: Haiku for Summarization, Sonnet for Analysis

Pattern from audit sprint: Four audit background agents each read 50–80 files and returned structured reports (150–200s wall time). All four used the Haiku-class explore model. All four produced accurate results. Total token cost: approximately the same as one Sonnet invocation with the same context.

The lesson: "Reading and organizing" is a Fast-tier task even when the output matters. "Analyzing and deciding based on what you read" is a Reasoning-tier task. The audit agents did the former; the session's main agent did the latter using the audit reports as compressed input (~15K tokens total from four Haiku agents, vs. ~80K tokens if the Sonnet session had read all the files directly).

Template for expensive research:

Fast agent (Haiku): read files, grep patterns, count things, organize findings
  → output: structured report (~2K tokens)

Reasoning agent (Sonnet): receive structured reports, analyze, decide, act
  → input: N × 2K tokens from fast agents
  → avoids: N × 30K tokens of direct file reading

Adapting for Other Providers

This guide focuses on GitHub Copilot (GHCP). If your team routes agents to a different provider, model names, tier pricing, and rate limits will differ:

Provider Model tier equivalent Rate limit difference Billing unit
Azure OpenAI GPT-4o ~= Standard, GPT-4o-mini ~= Fast TPM/RPM per deployment Per token
Anthropic API Opus ~= Premium, Sonnet ~= Standard, Haiku ~= Fast Per-minute limits Per token
AWS Bedrock On-demand vs. provisioned Regional quotas Per token
OpenAI API GPT-4o ~= Standard, GPT-4o-mini ~= Fast Tier-based RPM/TPM Per token

Adjust rate-limit constants and budget thresholds to match the provider's documented limits. For UBB cost estimation and monitoring guidance, see ubb-token-guidance.md.