Agent Model Optimization Guide¶

Choosing the right LLM for each agent role is a cost-performance tradeoff. Default-to-cheapest wastes time on retries and low-quality output; default-to-premium wastes money on tasks that don't need it. This guide provides a principled tier matrix so every agent runs on the model that matches its cognitive demand.

GHCP-specific: This guidance was developed and tested against GitHub Copilot (GHCP). If you are using Azure OpenAI, Anthropic API, AWS Bedrock, or another provider, model names, tier pricing, and rate limits will differ. See Adapting for Other Providers. Discovery context: During the app-migration-with-ai Sprint 2, all agents defaulted to Haiku. Architect and security tasks underperformed significantly, requiring manual rework. This guide codifies the lessons learned.

Tier Matrix¶

Role	Recommended Model	Tier	Rationale
architect	claude-opus-4.7	Premium	High-stakes design decisions requiring deep, multi-step reasoning
security_analyst	claude-opus-4.7	Premium	Security analysis requires thorough reasoning and cannot afford shortcuts
reviewer / code-review	claude-sonnet-4.6	Reasoning	Nuanced code analysis but not premium-tier complexity
researcher	claude-sonnet-4.6	Reasoning	Analysis and synthesis require good reasoning depth
qa / manual-test-strategy / exploratory-charter / strategy-to-automation	claude-sonnet-4.6	Reasoning	Structured thinking, edge case identification, test design
backend-dev / frontend-dev / middleware-dev / data-tier / code	gpt-5.3-codex	Code	Code-optimized model tuned for generation, refactoring, and debugging
sprint-planner / release-manager / project-onboarding	claude-sonnet-4.6	Reasoning	Planning and decomposition need good reasoning, not raw code output
new-customization	claude-sonnet-4.6	Reasoning	Deciding between customization types requires structured reasoning
merge-coordinator / rollout-basecoat / config-auditor	claude-haiku-4.5	Fast	Routine automation with well-defined steps — speed and cost matter most
agent-watchdog / sprint-demo	gpt-5.4-mini	Fast	Simple automation tasks with minimal reasoning requirements

Tier Definitions¶

Premium — `claude-opus-4.7`¶

Use for tasks where a mistake is expensive or irreversible: architecture decisions, security reviews, threat modeling, compliance analysis. These tasks require deep multi-step reasoning, weighing tradeoffs, and producing output that will be trusted without a second opinion.

Cost: ~5× Sonnet. Use deliberately. Variants: claude-opus-4.7-high (extended chain-of-thought), claude-opus-4.6 (proven fallback).

Reasoning — `claude-sonnet-4.6`¶

The workhorse tier. Use for tasks that require genuine analysis — code review, test strategy, planning, research — but where the output will be reviewed by a human or validated by CI before it matters. Good balance of quality and cost.

Cost: Baseline reference tier.

Code — `gpt-5.3-codex`¶

Use for code generation, refactoring, migration, and debugging. This model is specifically optimized for code tasks and outperforms general-purpose models on implementation work. Not ideal for prose-heavy analysis or strategic reasoning.

Cost: Comparable to Sonnet. Value comes from code-specific optimization.

Fast — `claude-haiku-4.5` / `gpt-5.4-mini`¶

Use for well-defined, repetitive tasks: file scanning, config auditing, branch operations, rollout scripts, simple automation. The task should have clear inputs, deterministic steps, and easily validated output. If the agent needs to "think," it probably needs a higher tier.

Cost: ~10× cheaper than Sonnet. Use aggressively for routine work.

When to Override the Default¶

Override the recommended model when:

Situation	Override Direction	Example
Task is unusually complex for the role	↑ Upgrade one tier	A backend-dev task involving a complex distributed transaction → claude-sonnet-4.6
Task is unusually simple for the role	↓ Downgrade one tier	A code-review of a single-line typo fix → claude-haiku-4.5
Output will not be human-reviewed	↑ Upgrade one tier	Automated security scan running unattended → claude-opus-4.7
Output will be heavily reviewed	↓ Downgrade one tier	Draft PR description that a human will rewrite anyway → claude-haiku-4.5
Budget is constrained	↓ Use minimum viable tier	See the Minimum column in each agent's `## Model` section
Task requires cross-domain reasoning	↑ Upgrade one tier	A backend-dev task that also requires security analysis → claude-sonnet-4.6

Cost Considerations¶

Rough relative cost per million tokens (input + output blended):

Model	Relative Cost	Best For
claude-opus-4.7	5.0×	Architecture, security, high-stakes reasoning
claude-sonnet-4.6	1.0× (baseline)	Analysis, review, planning, test strategy
gpt-5.3-codex	~1.0×	Code generation, refactoring, debugging
claude-haiku-4.5	0.1×	Routine automation, scanning, simple tasks
gpt-5.4-mini	0.08×	Simple automation, monitoring, formatting
gpt-4.1	0.05×	Binary checks, guardrails, cheapest

Rule of thumb: If 10 Haiku runs cost less than 1 Sonnet run and the Haiku output is good enough, use Haiku. If a single Opus run saves you from a production incident, use Opus.

Consumer Configuration¶

In agent `.md` files¶

Each agent file includes a ## Model section:

## Model
**Recommended:** claude-sonnet-4.6
**Rationale:** Analysis tasks need good reasoning depth
**Minimum:** claude-haiku-4.5

Recommended — use this model by default
Rationale — why this tier was chosen (helps future reviewers)
Minimum — the cheapest model that can still complete the task acceptably; use when budget is constrained or the specific invocation is simple

In orchestration code¶

When spawning agents programmatically, pass the model as a parameter:

// Use the recommended model from the agent's ## Model section
const result = await spawnAgent({
  role: 'backend-dev',
  model: 'gpt-5.3-codex',   // from agents/backend-dev.agent.md
  prompt: taskDescription
});

In CI/CD pipelines¶

Set model selection via environment variables:

env:
  AGENT_MODEL_BACKEND: gpt-5.3-codex
  AGENT_MODEL_REVIEW: claude-sonnet-4.6
  AGENT_MODEL_SECURITY: claude-opus-4.6
  AGENT_MODEL_DEFAULT: claude-haiku-4.5

Agent Template Pattern¶

When creating a new agent, include the ## Model section in the .agent.md file. Place it immediately before the ## Governance section:

## Model
**Recommended:** <model-name>
**Rationale:** <one line explaining why this tier>
**Minimum:** <cheapest acceptable model>

Choose the tier based on the cognitive demand of the agent's primary task:

Cognitive Demand	Tier	Model
Deep multi-step reasoning, high-stakes decisions	Premium	claude-opus-4.6
Analysis, structured thinking, planning	Reasoning	claude-sonnet-4.6
Code generation, refactoring, implementation	Code	gpt-5.3-codex
Routine steps, scanning, simple automation	Fast	claude-haiku-4.5 or gpt-5.4-mini

Lessons from Production¶

These examples come from BaseCoat's own sprint history. Each led to a concrete change in the tier matrix or override rules above.

Sprint 2: The Haiku Default Failure¶

What happened: All agents in Sprint 2 defaulted to claude-haiku-4.5. The architect agent was tasked with designing a multi-region failover strategy. The output was structurally correct but shallow — it described patterns without evaluating tradeoffs, skipped durability edge cases, and required two rounds of manual rework before the design was usable.

Root cause: Haiku lacks the chain-of-thought depth for multi-step reasoning under uncertainty. Architect tasks are not "follow a template" tasks — they require weighing contradictory constraints.

What changed: The tier matrix now assigns architect to Premium unconditionally, with an explicit override note: "never downgrade below Reasoning even under budget pressure."

Signal to watch for: If an agent produces output that is structurally correct but feels thin — correct vocabulary, missing substance — it is likely under-tiered.

Sprint 19: Haiku for Scanning, Not for Judgment¶

What happened: A config-auditor agent using claude-sonnet-4.6 was processing 100+ config files per sprint. The cost was material. Switching to claude-haiku-4.5 produced identical results for the scan phase (does this value match this pattern?) but failed on the judgment phase (is this config safe given the deployment context?).

Resolution: Split the agent into two steps — Haiku for the scan loop, Sonnet for the final risk assessment. Cost dropped ~75% with no quality regression on the judgment output.

Pattern: Scanning is a Fast-tier task. Judgment on scan results is a Reasoning-tier task. If your agent does both in one pass, split it.

Sprint 24: Fleet Rate Limit Discovery¶

What happened: A sprint launch dispatched 5 concurrent background agents. After ~90 seconds, three returned 429 errors and stopped mid-task. Retrying immediately produced the same result. The sprint had to be restarted with a manual wave pattern.

What we learned: - Enterprise Copilot API limits concurrent sessions, not just requests-per-minute - 3 concurrent agents = safe ceiling - 4 = risky depending on org activity at the time - 5+ = near-certain 429 within 60–90 seconds - Recovery: wait 90 seconds, then restart with reduced concurrency

What changed: Added a 15-second inter-agent stagger to all sprint kickoff scripts. Wave size capped at 3 in sprint-kickoff-safe.ps1. The rate-limit guidance doc was updated to cover AI/Copilot fleet limits (previously only covered GitHub REST API).

Model connection: Running 5 agents on Premium models worsens this. Premium models use more server-side resources. If you must run 5+ agents, use Fast/Code tier where possible to reduce per-agent resource consumption.

Ongoing: Explore Agents Are Correctly Tiered at Fast/Haiku¶

Observation: The BaseCoat audit agents (audit-asset-counts, audit-ci-governance, audit-philosophy-adr, audit-scripts-memory) each ran for 150–200 seconds using the Haiku-class explore agent. All four produced accurate, well-structured reports.

Why this works: Explore agents read files, grep patterns, and report findings. This is scanning and synthesis from structured data — a Fast-tier task. They do not need multi-step reasoning or judgment. Sonnet would have cost ~10× more per agent for no improvement in output quality.

Rule confirmed: Match tier to the cognitive demand of the task, not the perceived importance of the output. An audit report can be important without requiring Premium reasoning to produce.

Adapting for Other Providers¶

The tier matrix above uses GitHub Copilot (GHCP) model names. If your agents target a different provider, map the tiers as follows:

Provider	Model tier equivalent	Rate limit difference	Billing unit
Azure OpenAI	GPT-4o ~= Standard, GPT-4o-mini ~= Fast	TPM/RPM per deployment	Per token
Anthropic API	Opus ~= Premium, Sonnet ~= Standard, Haiku ~= Fast	Per-minute limits	Per token
AWS Bedrock	On-demand vs. provisioned	Regional quotas	Per token
OpenAI API	GPT-4o ~= Standard, GPT-4o-mini ~= Fast	Tier-based RPM/TPM	Per token

Substitute provider-specific model names in agent ## Model sections and CI/CD environment variables. Cost considerations and override logic remain the same. For UBB billing and monitoring guidance, see ubb-token-guidance.md.

instructions/governance.instructions.md — Section 10 covers model awareness policy
instructions/token-economics.instructions.md — Turn budget classification and fast-path routing
Individual agent files in agents/ — each contains a ## Model section
token-optimization.md — Context window management and budget allocation
ubb-token-guidance.md -- UBB billing model, cost estimation, alert thresholds