Token Optimization & Context Handling¶
Strategies for managing token budgets, compressing context, and handing off state between agents in a multi-agent system. Companion to MODEL_OPTIMIZATION.md (model selection) and ../architecture/multi-agent-orchestration-patterns.md (branch coordination).
Tracking: Issue #42
GHCP-specific: This guidance was developed and tested against GitHub Copilot (GHCP). If you are using Azure OpenAI, Anthropic API, AWS Bedrock, or another provider, model names, tier pricing, and rate limits will differ. See Adapting for Other Providers.
1. Context Window Management Strategies¶
Every model has a finite context window. Treating it as unlimited leads to degraded output quality, truncated responses, and wasted spend.
Know Your Limits¶
| Model | Context Window | Effective Limit | Notes |
|---|---|---|---|
| claude-opus-4.6 | 200K tokens | ~160K usable | Reserve 20% for output generation |
| claude-sonnet-4.6 | 200K tokens | ~160K usable | Same reservation applies |
| gpt-5.3-codex | 200K tokens | ~160K usable | Code-optimized; large inputs degrade non-code reasoning |
| claude-haiku-4.5 | 200K tokens | ~160K usable | Fast but quality drops sharply past ~80K input |
| gpt-5.4-mini | 128K tokens | ~100K usable | Budget model; keep inputs under 60K for reliable output |
Rule of thumb: Reserve 20% of the context window for output. If you need 40K tokens of output, you have 160K minus 40K = 120K for input.
Layered Context Loading¶
Load context in priority order rather than dumping everything at once:
- System instructions — governance rules, role definition (~1–2K tokens)
- Task-specific instructions — the immediate objective (~0.5–1K tokens)
- Critical reference files — only files the agent must read to complete the task
- Supporting context — examples, history, related docs (load only if budget remains)
Stop loading when you reach 60% of the effective limit. The remaining 40% covers agent reasoning and output.
2. Token Budget Allocation per Agent Role¶
Align token budgets with the model tier from MODEL_OPTIMIZATION.md:
| Agent Role | Model Tier | Recommended Input Budget | Max Output Budget | Rationale |
|---|---|---|---|---|
| architect | Premium | 80K | 40K | Deep reasoning needs room for chain-of-thought |
| security_analyst | Premium | 80K | 40K | Must analyze full code paths without truncation |
| reviewer / code-review | Reasoning | 60K | 20K | Diffs + context; output is structured comments |
| researcher | Reasoning | 80K | 30K | May ingest large docs; output is synthesis |
| backend-dev / frontend-dev | Code | 60K | 40K | Code generation needs generous output budget |
| sprint-planner | Reasoning | 40K | 20K | Structured decomposition, not large inputs |
| merge-coordinator | Fast | 20K | 10K | Small diffs, merge commands, status checks |
| config-auditor / watchdog | Fast | 30K | 5K | Scan-and-report; output is a short checklist |
Budget Enforcement Pattern¶
function enforceTokenBudget(prompt, maxInputTokens) {
const estimated = estimateTokens(prompt);
if (estimated > maxInputTokens) {
return compressContext(prompt, maxInputTokens);
}
return prompt;
}
3. Prompt Compression Techniques¶
When context exceeds the budget, compress — don't truncate blindly.
3.1 Summarization¶
Replace large blocks with concise summaries. Use a fast-tier model (Haiku) to generate summaries before passing them to a higher-tier model.
Before (2,400 tokens):
Full git diff of 15 files with 200 changed lines
After (400 tokens):
"Summary: 15 files changed across src/auth/ and src/api/.
Key changes: JWT validation refactored, rate-limit middleware added,
3 new API routes (/users, /sessions, /tokens). No deletions."
Cost: One Haiku summarization call (~0.02¢) can save 2,000 tokens on a Sonnet call (~0.6¢).
3.2 Selective Inclusion¶
Only include what the agent needs for its specific task:
| Agent Task | Include | Exclude |
|---|---|---|
| Code review | Changed files, diff hunks, test results | Unrelated source files, full history |
| Architecture | Interface definitions, dependency graph | Implementation bodies, test fixtures |
| Security scan | Auth code, input handling, config | UI components, styling, docs |
| Sprint planning | Issue list, velocity data, blockers | Source code, CI logs |
3.3 Truncation with Markers¶
When you must truncate, leave breadcrumbs so the agent knows context was removed:
[FILE: src/auth/jwt.ts — 340 lines, showing lines 1-50 and 280-340]
[TRUNCATED: lines 51-279 contain helper functions — request full file if needed]
3.4 Reference by Pointer¶
Instead of inlining large files, reference them:
For the full API schema, see: docs/api-schema.yaml (420 lines, ~8K tokens)
Key endpoints relevant to this task: POST /auth/login, DELETE /auth/session
4. Context Handoff Between Agents¶
When one agent's output becomes another agent's input, transfer only what matters.
What to Pass¶
| Data | Pass? | Format |
|---|---|---|
| Task result / deliverable | ✅ Always | Full output |
| Decision rationale | ✅ Always | 2–3 sentence summary |
| Files created or modified | ✅ Always | File paths + brief description |
| Unresolved issues or blockers | ✅ Always | Structured list |
| Error messages encountered | ✅ If relevant | Exact error text |
| Full conversation history | ❌ Never | Summarize instead |
| Intermediate reasoning steps | ❌ Never | Only pass conclusions |
| Unchanged reference files | ❌ Never | Agent can load its own |
Handoff Template¶
## Agent Handoff: {source-role} → {target-role}
### Completed
- {what was done, 1-2 lines each}
### Artifacts
- `path/to/file.ts` — {what it contains}
- `path/to/test.ts` — {test coverage summary}
### Decisions Made
- {decision}: {rationale in one line}
### Open Items
- {anything the next agent must address}
### Context Files Needed
- {only files the next agent should load}
Target size: Handoff documents should be 500–1,500 tokens. If yours exceeds 2K tokens, compress further.
Pipeline Example¶
Architect (Opus, ~80K input)
→ produces: design doc + handoff (1.2K tokens)
Backend-Dev (Codex, ~60K input)
→ receives: handoff + design doc + relevant source files
→ produces: implementation + handoff (800 tokens)
Reviewer (Sonnet, ~60K input)
→ receives: handoff + diff + test results
→ produces: review comments + approval/rejection
5. Caching Strategies for Repeated Context¶
Avoid re-reading and re-tokenizing the same content across agent invocations.
System Prompt Caching¶
Most providers cache system prompts across calls with identical prefixes. Structure prompts so the stable prefix (governance rules, role definition) stays constant:
[CACHED — identical across all calls for this agent role]
System instructions (governance.instructions.md)
Role definition (agent .md file)
[VARIABLE — changes per invocation]
Task-specific context
File contents
Conversation history
Savings: Anthropic's prompt caching charges ~10% of normal input cost for cached tokens. A 3K-token system prompt called 50 times saves ~135K billable tokens.
Cross-Agent Shared Context¶
When multiple agents need the same reference (e.g., a project spec), load it once and pass a summary to subsequent agents rather than having each agent re-read the full document.
Stale Context Invalidation¶
Cache keys should include: - File content hash (not just path — files change) - Branch name (context differs across branches) - Timestamp with TTL (default: 15 minutes for active sprint work)
6. Measurement and Monitoring¶
Token Usage Tracking¶
Track per-agent, per-invocation:
| Metric | How to Capture | Why It Matters |
|---|---|---|
| Input tokens | Provider API response | Cost attribution |
| Output tokens | Provider API response | Cost attribution + quality signal |
| Cache hit tokens | Provider API response (if available) | Caching effectiveness |
| Context utilization | Input tokens ÷ effective window | Over 80% = risk of quality degradation |
| Compression ratio | Original tokens ÷ compressed tokens | Compression effectiveness |
Cost Estimation Formula¶
Per-invocation cost =
(input_tokens × input_price_per_1M / 1,000,000)
+ (output_tokens × output_price_per_1M / 1,000,000)
- (cached_tokens × cache_discount_per_1M / 1,000,000)
Sprint-Level Budget¶
Estimate total sprint token usage:
Example for a 10-issue sprint: - 10 planning calls (Sonnet, ~40K input, ~10K output) ≈ 500K tokens - 30 implementation calls (Codex, ~60K input, ~30K output) ≈ 2.7M tokens - 20 review calls (Sonnet, ~40K input, ~10K output) ≈ 1M tokens - 40 automation calls (Haiku, ~20K input, ~5K output) ≈ 1M tokens - Total: ~5.2M tokens per sprint
Alerts¶
Set thresholds to catch runaway usage: - Single invocation exceeds 150K input tokens → warn - Agent role exceeds 2× its average daily usage → investigate - Sprint total exceeds budget by 20% → pause and review
7. Instruction File Sizing¶
Instruction files (.instructions.md, .agent.md) are loaded into every invocation. Oversized instructions waste budget on every call.
| File Type | Target Size | Max Size | Tokens (est.) |
|---|---|---|---|
.instructions.md |
1–3 KB | 5 KB | ~500–1,500 |
.agent.md |
2–4 KB | 6 KB | ~800–2,000 |
| Governance instructions | 3–5 KB | 8 KB | ~1,200–2,500 |
Sizing Guidelines¶
- One concern per file. Split multi-topic instructions into separate files.
- Link, don't inline. Reference detailed docs by path instead of copying content.
- Prune examples. One good example beats three redundant ones.
- Audit quarterly. Remove outdated rules that no longer apply.
8. Practical Examples with Token Counts¶
Example A: Code Review — Well-Optimized¶
System prompt (governance + reviewer role): 1,800 tokens
Task instructions: 400 tokens
Diff (3 files, 120 lines changed): 2,200 tokens
Test results summary: 300 tokens
───────────────────────────────────────────────
Total input: 4,700 tokens
Output (review comments): 1,200 tokens
Model: claude-sonnet-4.6
Estimated cost: ~$0.02
Example B: Code Review — Unoptimized¶
System prompt (full governance + all instructions): 4,500 tokens
Full conversation history (15 turns): 12,000 tokens
All source files in the repo: 45,000 tokens
Full CI log output: 8,000 tokens
───────────────────────────────────────────────
Total input: 69,500 tokens
Output (same review comments): 1,200 tokens
Model: claude-sonnet-4.6
Estimated cost: ~$0.23
Savings: 91% cost reduction by loading only what the agent needs.
Example C: Architect → Backend-Dev Handoff¶
Architect output (full): 8,000 tokens
Handoff document (compressed): 1,200 tokens
Backend-dev loads: handoff + 3 source files: 6,400 tokens
───────────────────────────────────────────────
Backend-dev total input: 7,600 tokens
Without handoff (re-reads everything): 35,000 tokens
Savings: 78%
9. Task Complexity, Turn Budgets, and Adaptive Learning¶
Agent work has a cost that goes beyond tokens — it also consumes turns. Turns compound: each one adds history to the context window, narrows decision space, and raises the cost of a wrong approach. This section defines how to classify tasks, budget turns, and convert experience into reusable memory.
9.1 Complexity Classification¶
Classify each task before starting. State the class in your plan or at the top of your first response.
| Class | Definition | Typical indicators | Soft turn budget |
|---|---|---|---|
| Routine | Pattern fully covered by existing instructions or memory | "We have an instruction for this", existing template matches | ≤ 3 turns |
| Familiar | Similar to prior work but with new variables or constraints | "We've done something like this", partial memory match | ≤ 5 turns |
| Novel | No prior coverage; encountering this pattern for the first time | No matching instruction, no memory hit, new tool/API/pattern | Estimate N turns upfront |
For Novel tasks, state the estimated turn cost at task start:
Task: Integrate Azure Service Connector with App Configuration
Class: Novel — no prior Service Connector instruction exists
Estimated turns: 6 (learning cost: ~3 for discovery, ~3 for implementation + validation)
Learning cost amortization: A Novel task that succeeds becomes Familiar the next time. Familiar tasks that are done repeatedly become Routine. Memory is the mechanism — see §9.3.
9.2 Failure Protocol — Stuck After 5 Turns¶
If a task has consumed more than 5 turns and there has been no measurable forward progress, stop. Do not continue with more of the same approach.
What counts as forward progress?¶
| Signal | Counts? |
|---|---|
| A new test passes that did not pass before | ✅ Yes |
| A new error class is resolved (not just a different message for the same root cause) | ✅ Yes |
| A file reaches its intended target state | ✅ Yes |
| A blocker is identified and removed | ✅ Yes |
| The same error appears with different output | ❌ No |
| More lines of code added without test validation | ❌ No |
| Escalating to a higher-tier model without changing the approach | ❌ No |
When stuck:¶
- Log to memory. Call
store_memorywith: - Task description (what was being attempted)
- Approach tried (summarize the strategy, not the code)
- Failure mode (exact error, logical dead-end, or missing capability)
-
Blocking signal (what specifically is preventing progress)
-
Reassess, don't escalate. Change the approach before changing the model tier. Common pivots:
- Break the task into smaller independently-verifiable units
- Load docs or examples that were previously skipped to save tokens
- Invert the approach (bottom-up instead of top-down, or vice versa)
-
Consult memory for similar prior failures
-
Escalate model tier only as a last resort, and only when the problem genuinely requires deeper reasoning — not just more attempts.
Early warning: the 80/50 rule¶
At 80% of your turn budget, if you have less than 50% progress toward the goal, pause and reassess. Don't wait until fully stuck.
Turn budget: 5
Current turn: 4 (80%)
Progress: 1 test passing of 4 required (25%)
→ Pause. Reassess before turn 5.
9.3 Success Protocol — Learning Reinforcement¶
When a task completes within its turn budget and test validation passes, evaluate whether the solution involved a non-obvious pattern.
Reinforce when: - The solution required a discovery that no existing instruction covers - The approach was non-obvious and could save future agents 2+ turns - A specific error was encountered and resolved in a reusable way
Skip when: - The task followed an existing instruction exactly - The solution is boilerplate (adding a route, writing a standard test, etc.) - The memory fact would duplicate existing instruction content
What to store:
Subject: <topic area>
Fact: <pattern or rule, ≤200 chars>
Citations: <file:line or "User input: ..." or "Validated in task X">
Reason: <why this will help future tasks; what turns it saves>
Good example:
Subject: gh-aw compilation
Fact: gh aw safe-outputs: add-labels and add-comment take no sub-properties.
allowed-labels belongs under create-issue, not add-labels.
Citations: .github/workflows/issue-triage.md — Sprint 11 compile iteration
Reason: This error burned 2 turns on Sprint 11. Any future agentic workflow
author will hit it immediately without this memory.
Bad example (too generic — skip):
9.4 Turn Accounting Summary¶
Before task:
Classify: Routine / Familiar / Novel
If Novel: state estimated turn budget
During task:
Track actual turns vs. budget
At 80% budget / <50% progress: reassess
On failure (>5 turns, no progress):
store_memory(failure pattern)
change approach or escalate
On success (within budget + tests pass):
if novel pattern: store_memory(solution pattern)
classification → drops one level (Novel→Familiar, Familiar→Routine)
10. From the Field: BaseCoat Sprint Experience¶
These examples come from BaseCoat's own development sprints. They illustrate how the strategies in §1–9 play out in practice.
Context Loading: The Cost of Speculative Reads¶
What happened: In early sprints, the explore skill pattern was: read the entire
agents/ directory to understand the repo state. At 77 agents (~400 tokens each),
this consumed ~30K tokens before any task work started.
What changed: The hot-index pattern. The L2 memory index (memory-index.instructions.md)
loads a curated summary (~1,500 tokens) of all key facts about the repo. This
replaces the full directory scan for 80% of tasks. The other 20% (Novel tasks
that need specific agent details) load only the targeted agent files.
Measured impact: Task startup dropped from ~30K tokens (full scan) to ~3K tokens (hot index + targeted load). On a sprint with 30 agent invocations, this saves ~810K tokens.
Turn Budget: The 5-Turn Recovery Pattern¶
What happened: During Sprint 19, an agent was tasked with wiring
check-coherence.ps1 into CI. It consumed 7 turns and produced no working
configuration. The agent kept modifying the workflow YAML — changing indentation,
restructuring steps — but the root problem was that the script exits 0 by default
and CI never knew it had run with violations.
The 80/50 signal was ignored: At turn 4 (80% of a ≤5 budget), the agent had 1 of 3 required checks passing (33% progress). The correct move was to pause and reassess.
Resolution in turn 8: Changed approach — instead of fixing the workflow, added
-Strict to the script call in run-tests.ps1. The same tests that had been
passing now surfaced the violations. Three lines of change instead of 50.
Pattern confirmed: At 80% of turn budget with <50% progress, the approach is wrong. More turns with the same approach do not fix this.
Instruction Sizing: When Instructions Become Context Debt¶
Observation across Sprint 23–24: The governance instruction file
(instructions/governance.instructions.md) grew to 8KB+ (estimated ~2,500 tokens).
Because it has applyTo: "**/*", it loads on every agent invocation. Across a
30-invocation sprint, the governance instruction alone accounts for ~75K tokens.
What this means: Every 1KB added to a global instruction file costs ~750K tokens per year at BaseCoat's sprint velocity (~300 invocations/month). Global instruction files have a much higher amortized cost than targeted ones.
Rule reinforced: Global instructions (applyTo: "**/*") should contain only
invariant rules. Domain-specific guidance belongs in scoped instruction files
(applyTo: "agents/**", applyTo: "*.yml").
Compression: Haiku for Summarization, Sonnet for Analysis¶
Pattern from audit sprint: Four audit background agents each read 50–80 files and returned structured reports (150–200s wall time). All four used the Haiku-class explore model. All four produced accurate results. Total token cost: approximately the same as one Sonnet invocation with the same context.
The lesson: "Reading and organizing" is a Fast-tier task even when the output matters. "Analyzing and deciding based on what you read" is a Reasoning-tier task. The audit agents did the former; the session's main agent did the latter using the audit reports as compressed input (~15K tokens total from four Haiku agents, vs. ~80K tokens if the Sonnet session had read all the files directly).
Template for expensive research:
Fast agent (Haiku): read files, grep patterns, count things, organize findings
→ output: structured report (~2K tokens)
Reasoning agent (Sonnet): receive structured reports, analyze, decide, act
→ input: N × 2K tokens from fast agents
→ avoids: N × 30K tokens of direct file reading
Adapting for Other Providers¶
This guide focuses on GitHub Copilot (GHCP). If your team routes agents to a different provider, model names, tier pricing, and rate limits will differ:
| Provider | Model tier equivalent | Rate limit difference | Billing unit |
|---|---|---|---|
| Azure OpenAI | GPT-4o ~= Standard, GPT-4o-mini ~= Fast | TPM/RPM per deployment | Per token |
| Anthropic API | Opus ~= Premium, Sonnet ~= Standard, Haiku ~= Fast | Per-minute limits | Per token |
| AWS Bedrock | On-demand vs. provisioned | Regional quotas | Per token |
| OpenAI API | GPT-4o ~= Standard, GPT-4o-mini ~= Fast | Tier-based RPM/TPM | Per token |
Adjust rate-limit constants and budget thresholds to match the provider's documented limits. For UBB cost estimation and monitoring guidance, see ubb-token-guidance.md.
Related References¶
MODEL_OPTIMIZATION.md— Model tier matrix and cost considerations../architecture/multi-agent-orchestration-patterns.md— Branch coordination for parallel agentsinstructions/governance.instructions.md— Section 10: Token and Model Awareness- Issue #42 — Tracking issue for token optimization
- Issue #44 — Token budget and cost attribution
ubb-token-guidance.md-- UBB billing model, cost estimation, monitoring