LLM Models
Configure models with pricing information for accurate cost tracking across all billing dimensions.
Overview
LLM Models represent the specific AI models available from your configured providers. Control Bridge automatically discovers available models, but you can configure pricing information to enable accurate cost tracking.
Modern LLM providers charge for more than just input and output tokens. Cache reads, cache writes, and long context requests each have their own pricing dimensions. Configuring all of these ensures the Usage Dashboard reflects your actual provider billing.
Understanding Models
Model Hierarchy
Provider (e.g., Anthropic)
└── Model (e.g., claude-sonnet-4-6)
├── Input pricing (per million tokens)
├── Output pricing (per million tokens)
├── Cache read discount (fraction of input price)
├── Cache write premium (multiplier on input price)
└── Pricing tiers (long context surcharge configuration)
Model Capabilities
Different models offer different tradeoffs:
| Model Tier | Speed | Intelligence | Cost | Best For |
|---|---|---|---|---|
| Small (Haiku, GPT-4o-mini) | Fast | Good | Low | Simple tasks, high volume |
| Medium (Sonnet, GPT-4o) | Moderate | Excellent | Medium | General use, balanced |
| Large (Opus, GPT-5.2) | Slower | Best | High | Complex reasoning |
Context Windows and Output Limits
Context window is the maximum number of tokens a model can process in a single request (input + output combined). Max output is the ceiling for generated response tokens.
| Provider | Model | Context Window | Max Output |
|---|---|---|---|
| Anthropic | Claude Opus 4.6 | 1,000,000 | 128,000 |
| Anthropic | Claude Sonnet 4.6 | 1,000,000 | 128,000 |
| Anthropic | Claude Opus/Sonnet 4.5 | 200,000 | 64,000 |
| Anthropic | Claude Haiku 4.5 | 200,000 | 64,000 |
| OpenAI | GPT-5.2 | 400,000 | 128,000 |
| OpenAI | GPT-5 Mini | 400,000 | 128,000 |
| OpenAI | GPT-4o / GPT-4o Mini | 128,000 | 16,384 |
| xAI | Grok 4 | 256,000 | 100,000 |
| xAI | Grok 4.1 Fast | 2,000,000 | 100,000 |
| Gemini | Gemini 2.5 Pro / Flash | 1,000,000 | 65,536 |
Context window sources (verified 2026-03-15):
Viewing Models
Navigate to Models
Go to Build > Governance > AI Providers and you will see models listed under each provider.
Model Information
For each model, you can see:
- Model ID - The technical identifier used by the provider
- Display Name - Human-readable name
- Input Price - Cost per million input tokens
- Output Price - Cost per million output tokens
- Cached Input Discount - Multiplier applied to cache read tokens
- Cache Write Premium - Multiplier applied to cache write tokens
- Pricing Tiers - Long context surcharge configuration (JSON)
- Status - Whether the model is available
Adding Models to a Provider
Using Model Templates
When adding a new model to a provider, you can use pre-configured templates that automatically populate model specifications:
- Go to Build > Governance > AI Providers
- Click on a provider to expand it
- Click Add Model
- Select a model from the Use a template dropdown
- The form auto-fills with:
- Model identifier
- Display name
- Context window size
- Input/output pricing
- Cache pricing values
- Default temperature
- Tool calling and streaming support
- Customize any values if needed
- Click Save
Templates are sourced from the global model database and include up-to-date pricing and capabilities for common models, including cache pricing defaults.
Manual Model Configuration
If no template is available for your model:
- Click Add Model on a provider
- Enter the model information manually:
- Model Name - The API identifier (e.g.,
claude-sonnet-4-6) - Display Name - Human-readable name
- Context Window - Maximum tokens the model can process
- Pricing - Input and output cost per million tokens
- Cache Pricing - Cache read discount and write premium
- Pricing Tiers - Long context surcharge JSON (if applicable)
- Capabilities - Tool calling, streaming support
- Model Name - The API identifier (e.g.,
- Click Save
Configuring Pricing
Accurate pricing enables:
- Cost tracking per execution
- Cost analysis by agent
- Budget monitoring and alerts
- ROI calculations including cache efficiency
Set Model Pricing
- Go to Build > Governance > AI Providers and click on a model to edit
- Enter the Input Price (per million tokens)
- Enter the Output Price (per million tokens)
- Enter the Cached Input Discount (multiplier for cache read tokens)
- Enter the Cache Write Premium (multiplier for cache write tokens)
- Enter Pricing Tiers JSON if the model has a long context surcharge
- Click Save
Cache Read Discount
The Cached Input Discount is a multiplier applied to the input price for cache read tokens. Values less than 1.0 represent a discount:
| Value | Meaning | Effective cost |
|---|---|---|
0.10 | 90% discount | Cache reads cost 10% of standard input price |
0.25 | 75% discount | Cache reads cost 25% of standard input price |
0.50 | 50% discount | Cache reads cost 50% of standard input price |
1.00 | No discount | Cache reads charged at full input rate |
Cache Write Premium
The Cache Write Premium is a multiplier applied to the input price for cache write (cache creation) tokens. Values greater than 1.0 represent a premium:
| Value | Meaning | Effective cost |
|---|---|---|
1.00 | No premium | Cache writes charged at standard input rate |
1.25 | 25% premium | Cache writes cost 125% of standard input price |
Pricing Tiers
The Pricing Tiers field accepts a JSON object that configures long context surcharges. Leave it empty (null) if the model has no surcharge:
{
"longContext": {
"thresholdTokens": 200000,
"inputMultiplier": 2.0,
"outputMultiplier": 1.5
}
}
thresholdTokens- Input token count above which the surcharge activatesinputMultiplier- Input price multiplier when above threshold (2.0 = double price)outputMultiplier- Output price multiplier when above threshold (1.5 = 50% more)
Current Pricing Reference
Prices as of March 2026 (verify with provider):
Anthropic
| Model | Input (per 1M) | Output (per 1M) | Cache Read Discount | Cache Write Premium | Long Context |
|---|---|---|---|---|---|
| Claude Opus 4.6 | $5.00 | $25.00 | 0.10 (90% off) | 1.25 (25% premium) | None (flat rate to 1M) |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 0.10 (90% off) | 1.25 (25% premium) | >200k: $6/$22.50 |
| Claude Opus 4.5 | $5.00 | $25.00 | 0.10 (90% off) | 1.25 (25% premium) | >200k: 2x/1.5x |
| Claude Sonnet 4.5 | $3.00 | $15.00 | 0.10 (90% off) | 1.25 (25% premium) | >200k: 2x/1.5x |
| Claude Haiku 4.5 | $1.00 | $5.00 | 0.10 (90% off) | 1.25 (25% premium) | >200k: 2x/1.5x |
Most Anthropic models have a long context surcharge at 200,000 tokens (2x input, 1.5x output). The exception is Claude Opus 4.6, which charges a flat rate up to 1M tokens with no surcharge.
OpenAI
| Model | Input (per 1M) | Output (per 1M) | Cache Read Discount | Cache Write Premium | Long Context |
|---|---|---|---|---|---|
| GPT-5.2 | $1.75 | $14.00 | 0.10 (90% off) | 1.00 (none) | >200k: 2x/1.5x |
| GPT-5 Mini | $0.25 | $2.00 | 0.10 (90% off) | 1.00 (none) | >200k: 2x/1.5x |
| GPT-4o | $2.50 | $10.00 | 0.50 (50% off) | 1.00 (none) | >200k: 2x/1.5x |
| GPT-4o Mini | $0.15 | $0.60 | 0.50 (50% off) | 1.00 (none) | >200k: 2x/1.5x |
OpenAI caching is automatic with no write premium. GPT-5 family models get a 90% cache discount; GPT-4o family gets 50%. Legacy models (GPT-4, GPT-3.5 Turbo) do not have long context surcharges.
xAI (Grok)
| Model | Input (per 1M) | Output (per 1M) | Cache Read Discount | Cache Write Premium |
|---|---|---|---|---|
| Grok 4 | $3.00 | $15.00 | 0.25 (75% off) | 1.00 (none) |
| Grok 4.1 Fast | $0.20 | $0.50 | 0.25 (75% off) | 1.00 (none) |
xAI models do not have a long context surcharge. Caching is automatic.
Google Gemini
| Model | Input (per 1M) | Output (per 1M) | Cache Read Discount | Cache Write Premium | Long Context |
|---|---|---|---|---|---|
| Gemini 2.5 Pro | $1.25 | $10.00 | 0.10 (90% off) | 1.00 (none) | >200k: $2.50/$15 |
| Gemini 2.5 Flash | $0.30 | $2.50 | 0.10 (90% off) | 1.00 (none) | >200k: 2x/1.5x |
Gemini models have a long context surcharge at 200,000 tokens: 2x input, 1.5x output.
Pricing Sources (verified 2026-03-15):
- Anthropic Claude Pricing | Prompt Caching
- OpenAI API Pricing
- Google Gemini API Pricing
- xAI Models and Pricing
- Azure OpenAI Pricing
Prices change frequently. Always verify current rates with your provider before configuring pricing. The prices above were last verified on March 15, 2026.
Understanding Cache Pricing
What Are Cache Reads and Cache Writes?
LLM providers cache portions of prompts (typically system prompts and tool definitions) to avoid reprocessing identical content on every request. When a request reuses cached content:
- Cache read - The provider retrieves previously processed prompt content from cache. Because the provider already processed this content, it charges a steep discount (typically 75-90% off standard input price).
- Cache write - The first time content is processed and stored in cache. Anthropic charges a small premium for this initial storage (25% above input price). Other providers like OpenAI handle caching automatically with no explicit write charge.
How Different Providers Handle Caching
Provider implementations differ in important ways:
| Provider | Cache writes | Cache reads | Write premium | Read discount |
|---|---|---|---|---|
| Anthropic | Explicit - you can see cache_creation_input_tokens | Explicit - cache_read_input_tokens | 1.25x | 0.10x (90% off) |
| OpenAI / Azure | Automatic - no separate field reported | Reported in cached_tokens (subset of prompt_tokens) | None (1.00x) | 0.50x (50% off) |
| xAI (Grok) | Automatic | Tracked internally | None (1.00x) | 0.25x (75% off) |
| Google Gemini | Explicit via Content Caching API | cached_content_token_count | None (1.00x) | 0.10x (90% off) |
Because providers report tokens differently, Control Bridge applies a provider-aware formula when calculating costs. Anthropic's input_tokens field includes cache write tokens but excludes cache read tokens. OpenAI's prompt_tokens field includes cache read tokens as a subset. The cost engine handles these differences automatically.
Why Cache Pricing Matters for Cost Accuracy
Without accurate cache pricing, cost reporting can be significantly skewed:
- Cache reads overcharged - If cache read tokens are billed at the full input rate instead of the discounted rate, costs appear higher than actual provider billing.
- Cache writes undercharged - If Anthropic cache write tokens are not tracked with the 1.25x premium, costs appear lower than actual billing.
- Net effect varies by workload - For agents with large, stable system prompts that benefit from caching, the reads discount typically outweighs the write premium. Configuring both correctly is essential for accurate cost tracking.
Cost Calculation Formula
Control Bridge uses a multi-dimensional cost formula for each LLM call:
totalCost =
(standardInputTokens / 1M) * inputPrice * inputMultiplier
+ (cacheReadTokens / 1M) * inputPrice * cachedInputDiscount * inputMultiplier
+ (cacheWriteTokens / 1M) * inputPrice * cacheWritePremium * inputMultiplier
+ (outputTokens / 1M) * outputPrice * outputMultiplier
Where:
standardInputTokens- Non-cached, non-write input tokens (provider-specific calculation)cachedInputDiscount- The Cached Input Discount configured on the model (e.g., 0.10)cacheWritePremium- The Cache Write Premium configured on the model (e.g., 1.25)inputMultiplierandoutputMultiplier- 1.0 normally; elevated to 2.0/1.5 when long context surcharge applies
Long Context Surcharges
What Triggers a Surcharge
When the total input tokens in a request exceed 200,000 tokens, most major providers charge a higher rate for that entire request. This threshold applies to the full context window including system prompts, conversation history, and document content.
How the Multipliers Work
When a request crosses the threshold, all input-category costs (standard input, cache reads, and cache writes) are multiplied by the inputMultiplier, and output costs by the outputMultiplier. The most common configuration across providers is:
- Input multiplier: 2.0 (double the standard input rate)
- Output multiplier: 1.5 (50% more than standard output rate)
This applies to the entire request - not just the tokens above the threshold.
Which Providers and Models Have Surcharges
| Provider | Threshold | Input multiplier | Output multiplier | Notes |
|---|---|---|---|---|
| Anthropic | 200,000 tokens | 2.0x | 1.5x | All Claude models |
| OpenAI | 200,000 tokens | 2.0x | 1.5x | Modern models (GPT-4o, GPT-5). Legacy GPT-4 and GPT-3.5 are exempt |
| Azure AI Foundry | 200,000 tokens | 2.0x | 1.5x | Modern models (gpt-4o, gpt-4o-mini, gpt-4-turbo). Legacy gpt-4, gpt-4-32k, gpt-35-turbo are exempt |
| Google Gemini | 200,000 tokens | 2.0x | 1.5x | All Gemini models |
| xAI (Grok) | N/A | N/A | N/A | No long context surcharge |
Long context surcharges are configured via the Pricing Tiers JSON field on each model. Models without a long context surcharge should have this field left empty.
Model Selection
Choosing the Right Model
Consider these factors when selecting models for agents:
Task Complexity
- Simple: Classification, routing, simple Q&A - Haiku / GPT-4o-mini
- Moderate: Customer support, content generation - Sonnet / GPT-4o
- Complex: Analysis, multi-step reasoning - Opus / GPT-4 Turbo
Response Time
- Smaller models respond faster
- Consider user expectations
- Batch processing can use larger models
Cost vs Quality
Calculate expected costs using the full multi-dimensional formula:
Daily cost = (
avg_standard_input_tokens * inputPrice
+ avg_cache_read_tokens * inputPrice * cachedInputDiscount
+ avg_cache_write_tokens * inputPrice * cacheWritePremium
+ avg_output_tokens * outputPrice
) * daily_executions / 1,000,000
For agents with large, stable system prompts, cache reads can significantly reduce the effective input cost. Use the Usage Dashboard's cache token data to understand your actual cache hit rates.
Model Recommendations by Use Case
| Use Case | Recommended Model | Reasoning |
|---|---|---|
| Email triage | Claude Haiku | Fast, cost-effective for classification |
| Customer support | Claude Sonnet | Good balance of quality and cost |
| Technical support | Claude Sonnet | Handles complexity well |
| Executive summaries | Claude Opus | Highest quality output |
| Data extraction | Claude Haiku | Structured tasks do not need large models |
Token Counting
What Are Tokens?
Tokens are the units LLMs use to process text:
- ~4 characters = 1 token (English)
- ~3/4 words = 1 token
- Code and special characters may use more tokens
Token Types
Each LLM call can produce several categories of tokens:
| Token type | Description | Billing |
|---|---|---|
| Input tokens | New, non-cached prompt content | Standard input price |
| Cache read tokens | Prompt content retrieved from provider cache | Discounted input price |
| Cache write tokens | Prompt content written to provider cache for the first time | Premium input price (Anthropic only) |
| Output tokens | Generated response content | Output price |
Estimating Token Usage
For a typical email interaction:
- System prompt: 200-500 tokens
- Email content: 100-1,000 tokens
- Response: 100-500 tokens
Average execution: ~1,000-2,000 total tokens. Agents with large system prompts benefit more from caching because the system prompt content is reused across calls.
Viewing Token Usage
Check token usage in Monitor > Activity > Agent Activity:
- Click on an execution
- View the Metrics section
- See input, output, cache read, and cache write token counts
Aggregate cache token totals by agent and day are available in Manage > Account > Usage.
Cost Tracking
Execution-Level Costs
Each execution records:
- Standard input tokens used
- Cache read tokens (from provider cache)
- Cache write tokens (cache creation, Anthropic only)
- Output tokens generated
- Calculated cost using the full multi-dimensional formula
Aggregate Views
In Manage > Account > Usage, view:
- Total costs by time period
- Costs by provider
- Costs by agent
- Cost trends
- Cache token totals per agent per day
Setting Budgets
While Control Bridge does not enforce budgets automatically, you can:
- Monitor costs in the Usage dashboard
- Set up alerts for unusual spending
- Review high-cost executions
- Adjust model selection for cost optimization
Best Practices
Cost Optimization
- Start with smaller models - Upgrade only if quality is insufficient
- Test model changes - Compare quality before switching
- Monitor outliers - Investigate unexpectedly expensive executions
- Optimize prompts - Shorter prompts mean fewer input tokens
- Leverage caching - Stable system prompts benefit from cache reads at a steep discount
Quality Assurance
- Review sample outputs - Regularly check agent responses
- A/B test models - Compare different models on same inputs
- User feedback - Track response quality ratings
- Adjust as needed - Upgrade models for struggling agents
Pricing Updates
- Check quarterly - Provider pricing changes frequently
- Update immediately - When pricing changes affect billing
- Document changes - Track pricing history
- Recalculate budgets - After significant price changes
Troubleshooting
Model Not Available
Symptoms: Model shows as unavailable or missing
Solutions:
- Verify provider API key has access to the model
- Check if model requires special access (waitlist)
- Confirm model ID is correct
Incorrect Cost Calculations
Symptoms: Costs do not match provider billing
Solutions:
- Verify input and output pricing is configured correctly
- Verify the Cached Input Discount matches your provider's cache read rate (e.g., 0.10 for Anthropic, 0.50 for OpenAI)
- Verify the Cache Write Premium is set correctly (1.25 for Anthropic, 1.00 for OpenAI and others)
- Check whether the model has a long context surcharge configured in Pricing Tiers - missing this will undercount costs for requests over 200,000 tokens
- Compare the token breakdown in Monitor > Activity > Agent Activity with your provider dashboard to confirm token counts match
Model Performance Issues
Symptoms: Slow responses or timeouts
Solutions:
- Check provider status for outages
- Consider switching to a faster model
- Optimize prompt length
- Review concurrent request limits