LLM Models

Configure models with pricing information for accurate cost tracking across all billing dimensions.

Overview

LLM Models represent the specific AI models available from your configured providers. Control Bridge automatically discovers available models, but you can configure pricing information to enable accurate cost tracking.

Modern LLM providers charge for more than just input and output tokens. Cache reads, cache writes, and long context requests each have their own pricing dimensions. Configuring all of these ensures the Usage Dashboard reflects your actual provider billing.

Understanding Models

Model Hierarchy

Provider (e.g., Anthropic)
└── Model (e.g., claude-sonnet-4-6)
    ├── Input pricing (per million tokens)
    ├── Output pricing (per million tokens)
    ├── Cache read discount (fraction of input price)
    ├── Cache write premium (multiplier on input price)
    └── Pricing tiers (long context surcharge configuration)

Model Capabilities

Different models offer different tradeoffs:

Model Tier	Speed	Intelligence	Cost	Best For
Small (Haiku, GPT-4o-mini)	Fast	Good	Low	Simple tasks, high volume
Medium (Sonnet, GPT-4o)	Moderate	Excellent	Medium	General use, balanced
Large (Opus, GPT-5.2)	Slower	Best	High	Complex reasoning

Context Windows and Output Limits

Context window is the maximum number of tokens a model can process in a single request (input + output combined). Max output is the ceiling for generated response tokens.

Provider	Model	Context Window	Max Output
Anthropic	Claude Opus 4.6	1,000,000	128,000
Anthropic	Claude Sonnet 4.6	1,000,000	128,000
Anthropic	Claude Opus/Sonnet 4.5	200,000	64,000
Anthropic	Claude Haiku 4.5	200,000	64,000
OpenAI	GPT-5.2	400,000	128,000
OpenAI	GPT-5 Mini	400,000	128,000
OpenAI	GPT-4o / GPT-4o Mini	128,000	16,384
xAI	Grok 4	256,000	100,000
xAI	Grok 4.1 Fast	2,000,000	100,000
Gemini	Gemini 2.5 Pro / Flash	1,000,000	65,536

Context window sources (verified 2026-03-15):

Viewing Models

Navigate to Models

Go to Build > Governance > AI Providers and you will see models listed under each provider.

Model Information

For each model, you can see:

Model ID - The technical identifier used by the provider
Display Name - Human-readable name
Input Price - Cost per million input tokens
Output Price - Cost per million output tokens
Cached Input Discount - Multiplier applied to cache read tokens
Cache Write Premium - Multiplier applied to cache write tokens
Pricing Tiers - Long context surcharge configuration (JSON)
Status - Whether the model is available

Adding Models to a Provider

Using Model Templates

When adding a new model to a provider, you can use pre-configured templates that automatically populate model specifications:

Go to Build > Governance > AI Providers
Click on a provider to expand it
Click Add Model
Select a model from the Use a template dropdown
The form auto-fills with:
- Model identifier
- Display name
- Context window size
- Input/output pricing
- Cache pricing values
- Default temperature
- Tool calling and streaming support
Customize any values if needed
Click Save

tip

Templates are sourced from the global model database and include up-to-date pricing and capabilities for common models, including cache pricing defaults.

Manual Model Configuration

If no template is available for your model:

Click Add Model on a provider
Enter the model information manually:
- Model Name - The API identifier (e.g., claude-sonnet-4-6)
- Display Name - Human-readable name
- Context Window - Maximum tokens the model can process
- Pricing - Input and output cost per million tokens
- Cache Pricing - Cache read discount and write premium
- Pricing Tiers - Long context surcharge JSON (if applicable)
- Capabilities - Tool calling, streaming support
Click Save

Configuring Pricing

Accurate pricing enables:

Cost tracking per execution
Cost analysis by agent
Budget monitoring and alerts
ROI calculations including cache efficiency

Set Model Pricing

Go to Build > Governance > AI Providers and click on a model to edit
Enter the Input Price (per million tokens)
Enter the Output Price (per million tokens)
Enter the Cached Input Discount (multiplier for cache read tokens)
Enter the Cache Write Premium (multiplier for cache write tokens)
Enter Pricing Tiers JSON if the model has a long context surcharge
Click Save

Cache Read Discount

The Cached Input Discount is a multiplier applied to the input price for cache read tokens. Values less than 1.0 represent a discount:

Value	Meaning	Effective cost
`0.10`	90% discount	Cache reads cost 10% of standard input price
`0.25`	75% discount	Cache reads cost 25% of standard input price
`0.50`	50% discount	Cache reads cost 50% of standard input price
`1.00`	No discount	Cache reads charged at full input rate

Cache Write Premium

The Cache Write Premium is a multiplier applied to the input price for cache write (cache creation) tokens. Values greater than 1.0 represent a premium:

Value	Meaning	Effective cost
`1.00`	No premium	Cache writes charged at standard input rate
`1.25`	25% premium	Cache writes cost 125% of standard input price

Pricing Tiers

The Pricing Tiers field accepts a JSON object that configures long context surcharges. Leave it empty (null) if the model has no surcharge:

{
  "longContext": {
    "thresholdTokens": 200000,
    "inputMultiplier": 2.0,
    "outputMultiplier": 1.5
  }
}

thresholdTokens - Input token count above which the surcharge activates
inputMultiplier - Input price multiplier when above threshold (2.0 = double price)
outputMultiplier - Output price multiplier when above threshold (1.5 = 50% more)

Current Pricing Reference

Prices as of March 2026 (verify with provider):

Anthropic

Model	Input (per 1M)	Output (per 1M)	Cache Read Discount	Cache Write Premium	Long Context
Claude Opus 4.6	$5.00	$25.00	0.10 (90% off)	1.25 (25% premium)	None (flat rate to 1M)
Claude Sonnet 4.6	$3.00	$15.00	0.10 (90% off)	1.25 (25% premium)	`>`200k: $6/$22.50
Claude Opus 4.5	$5.00	$25.00	0.10 (90% off)	1.25 (25% premium)	`>`200k: 2x/1.5x
Claude Sonnet 4.5	$3.00	$15.00	0.10 (90% off)	1.25 (25% premium)	`>`200k: 2x/1.5x
Claude Haiku 4.5	$1.00	$5.00	0.10 (90% off)	1.25 (25% premium)	`>`200k: 2x/1.5x

Most Anthropic models have a long context surcharge at 200,000 tokens (2x input, 1.5x output). The exception is Claude Opus 4.6, which charges a flat rate up to 1M tokens with no surcharge.

OpenAI

Model	Input (per 1M)	Output (per 1M)	Cache Read Discount	Cache Write Premium	Long Context
GPT-5.2	$1.75	$14.00	0.10 (90% off)	1.00 (none)	`>`200k: 2x/1.5x
GPT-5 Mini	$0.25	$2.00	0.10 (90% off)	1.00 (none)	`>`200k: 2x/1.5x
GPT-4o	$2.50	$10.00	0.50 (50% off)	1.00 (none)	`>`200k: 2x/1.5x
GPT-4o Mini	$0.15	$0.60	0.50 (50% off)	1.00 (none)	`>`200k: 2x/1.5x

OpenAI caching is automatic with no write premium. GPT-5 family models get a 90% cache discount; GPT-4o family gets 50%. Legacy models (GPT-4, GPT-3.5 Turbo) do not have long context surcharges.

xAI (Grok)

Model	Input (per 1M)	Output (per 1M)	Cache Read Discount	Cache Write Premium
Grok 4	$3.00	$15.00	0.25 (75% off)	1.00 (none)
Grok 4.1 Fast	$0.20	$0.50	0.25 (75% off)	1.00 (none)

xAI models do not have a long context surcharge. Caching is automatic.

Google Gemini

Model	Input (per 1M)	Output (per 1M)	Cache Read Discount	Cache Write Premium	Long Context
Gemini 2.5 Pro	$1.25	$10.00	0.10 (90% off)	1.00 (none)	`>`200k: $2.50/$15
Gemini 2.5 Flash	$0.30	$2.50	0.10 (90% off)	1.00 (none)	`>`200k: 2x/1.5x

Gemini models have a long context surcharge at 200,000 tokens: 2x input, 1.5x output.

Pricing Sources (verified 2026-03-15):

warning

Prices change frequently. Always verify current rates with your provider before configuring pricing. The prices above were last verified on March 15, 2026.

Understanding Cache Pricing

What Are Cache Reads and Cache Writes?

LLM providers cache portions of prompts (typically system prompts and tool definitions) to avoid reprocessing identical content on every request. When a request reuses cached content:

Cache read - The provider retrieves previously processed prompt content from cache. Because the provider already processed this content, it charges a steep discount (typically 75-90% off standard input price).
Cache write - The first time content is processed and stored in cache. Anthropic charges a small premium for this initial storage (25% above input price). Other providers like OpenAI handle caching automatically with no explicit write charge.

How Different Providers Handle Caching

Provider implementations differ in important ways:

Provider	Cache writes	Cache reads	Write premium	Read discount
Anthropic	Explicit - you can see `cache_creation_input_tokens`	Explicit - `cache_read_input_tokens`	1.25x	0.10x (90% off)
OpenAI / Azure	Automatic - no separate field reported	Reported in `cached_tokens` (subset of prompt_tokens)	None (1.00x)	0.50x (50% off)
xAI (Grok)	Automatic	Tracked internally	None (1.00x)	0.25x (75% off)
Google Gemini	Explicit via Content Caching API	`cached_content_token_count`	None (1.00x)	0.10x (90% off)

Because providers report tokens differently, Control Bridge applies a provider-aware formula when calculating costs. Anthropic's input_tokens field includes cache write tokens but excludes cache read tokens. OpenAI's prompt_tokens field includes cache read tokens as a subset. The cost engine handles these differences automatically.

Why Cache Pricing Matters for Cost Accuracy

Without accurate cache pricing, cost reporting can be significantly skewed:

Cache reads overcharged - If cache read tokens are billed at the full input rate instead of the discounted rate, costs appear higher than actual provider billing.
Cache writes undercharged - If Anthropic cache write tokens are not tracked with the 1.25x premium, costs appear lower than actual billing.
Net effect varies by workload - For agents with large, stable system prompts that benefit from caching, the reads discount typically outweighs the write premium. Configuring both correctly is essential for accurate cost tracking.

Cost Calculation Formula

Control Bridge uses a multi-dimensional cost formula for each LLM call:

totalCost =
  (standardInputTokens / 1M) * inputPrice * inputMultiplier
  + (cacheReadTokens / 1M) * inputPrice * cachedInputDiscount * inputMultiplier
  + (cacheWriteTokens / 1M) * inputPrice * cacheWritePremium * inputMultiplier
  + (outputTokens / 1M) * outputPrice * outputMultiplier

Where:

standardInputTokens - Non-cached, non-write input tokens (provider-specific calculation)
cachedInputDiscount - The Cached Input Discount configured on the model (e.g., 0.10)
cacheWritePremium - The Cache Write Premium configured on the model (e.g., 1.25)
inputMultiplier and outputMultiplier - 1.0 normally; elevated to 2.0/1.5 when long context surcharge applies

Long Context Surcharges

What Triggers a Surcharge

When the total input tokens in a request exceed 200,000 tokens, most major providers charge a higher rate for that entire request. This threshold applies to the full context window including system prompts, conversation history, and document content.

How the Multipliers Work

When a request crosses the threshold, all input-category costs (standard input, cache reads, and cache writes) are multiplied by the inputMultiplier, and output costs by the outputMultiplier. The most common configuration across providers is:

Input multiplier: 2.0 (double the standard input rate)
Output multiplier: 1.5 (50% more than standard output rate)

This applies to the entire request - not just the tokens above the threshold.

Which Providers and Models Have Surcharges

Provider	Threshold	Input multiplier	Output multiplier	Notes
Anthropic	200,000 tokens	2.0x	1.5x	All Claude models
OpenAI	200,000 tokens	2.0x	1.5x	Modern models (GPT-4o, GPT-5). Legacy GPT-4 and GPT-3.5 are exempt
Azure AI Foundry	200,000 tokens	2.0x	1.5x	Modern models (gpt-4o, gpt-4o-mini, gpt-4-turbo). Legacy gpt-4, gpt-4-32k, gpt-35-turbo are exempt
Google Gemini	200,000 tokens	2.0x	1.5x	All Gemini models
xAI (Grok)	N/A	N/A	N/A	No long context surcharge

info

Long context surcharges are configured via the Pricing Tiers JSON field on each model. Models without a long context surcharge should have this field left empty.

Model Selection

Choosing the Right Model

Consider these factors when selecting models for agents:

Task Complexity

Simple: Classification, routing, simple Q&A - Haiku / GPT-4o-mini
Moderate: Customer support, content generation - Sonnet / GPT-4o
Complex: Analysis, multi-step reasoning - Opus / GPT-4 Turbo

Response Time

Smaller models respond faster
Consider user expectations
Batch processing can use larger models

Cost vs Quality

Calculate expected costs using the full multi-dimensional formula:

Daily cost = (
  avg_standard_input_tokens * inputPrice
  + avg_cache_read_tokens * inputPrice * cachedInputDiscount
  + avg_cache_write_tokens * inputPrice * cacheWritePremium
  + avg_output_tokens * outputPrice
) * daily_executions / 1,000,000

For agents with large, stable system prompts, cache reads can significantly reduce the effective input cost. Use the Usage Dashboard's cache token data to understand your actual cache hit rates.

Model Recommendations by Use Case

Use Case	Recommended Model	Reasoning
Email triage	Claude Haiku	Fast, cost-effective for classification
Customer support	Claude Sonnet	Good balance of quality and cost
Technical support	Claude Sonnet	Handles complexity well
Executive summaries	Claude Opus	Highest quality output
Data extraction	Claude Haiku	Structured tasks do not need large models

Token Counting

What Are Tokens?

Tokens are the units LLMs use to process text:

~4 characters = 1 token (English)
~3/4 words = 1 token
Code and special characters may use more tokens

Token Types

Each LLM call can produce several categories of tokens:

Token type	Description	Billing
Input tokens	New, non-cached prompt content	Standard input price
Cache read tokens	Prompt content retrieved from provider cache	Discounted input price
Cache write tokens	Prompt content written to provider cache for the first time	Premium input price (Anthropic only)
Output tokens	Generated response content	Output price

Estimating Token Usage

For a typical email interaction:

System prompt: 200-500 tokens
Email content: 100-1,000 tokens
Response: 100-500 tokens

Average execution: ~1,000-2,000 total tokens. Agents with large system prompts benefit more from caching because the system prompt content is reused across calls.

Viewing Token Usage

Check token usage in Monitor > Activity > Agent Activity:

Click on an execution
View the Metrics section
See input, output, cache read, and cache write token counts

Aggregate cache token totals by agent and day are available in Manage > Account > Usage.

Cost Tracking

Execution-Level Costs

Each execution records:

Standard input tokens used
Cache read tokens (from provider cache)
Cache write tokens (cache creation, Anthropic only)
Output tokens generated
Calculated cost using the full multi-dimensional formula

Aggregate Views

In Manage > Account > Usage, view:

Total costs by time period
Costs by provider
Costs by agent
Cost trends
Cache token totals per agent per day

Setting Budgets

While Control Bridge does not enforce budgets automatically, you can:

Monitor costs in the Usage dashboard
Set up alerts for unusual spending
Review high-cost executions
Adjust model selection for cost optimization

Best Practices

Cost Optimization

Start with smaller models - Upgrade only if quality is insufficient
Test model changes - Compare quality before switching
Monitor outliers - Investigate unexpectedly expensive executions
Optimize prompts - Shorter prompts mean fewer input tokens
Leverage caching - Stable system prompts benefit from cache reads at a steep discount

Quality Assurance

Review sample outputs - Regularly check agent responses
A/B test models - Compare different models on same inputs
User feedback - Track response quality ratings
Adjust as needed - Upgrade models for struggling agents

Pricing Updates

Check quarterly - Provider pricing changes frequently
Update immediately - When pricing changes affect billing
Document changes - Track pricing history
Recalculate budgets - After significant price changes

Troubleshooting

Model Not Available

Symptoms: Model shows as unavailable or missing

Solutions:

Verify provider API key has access to the model
Check if model requires special access (waitlist)
Confirm model ID is correct

Incorrect Cost Calculations

Symptoms: Costs do not match provider billing

Solutions:

Verify input and output pricing is configured correctly
Verify the Cached Input Discount matches your provider's cache read rate (e.g., 0.10 for Anthropic, 0.50 for OpenAI)
Verify the Cache Write Premium is set correctly (1.25 for Anthropic, 1.00 for OpenAI and others)
Check whether the model has a long context surcharge configured in Pricing Tiers - missing this will undercount costs for requests over 200,000 tokens
Compare the token breakdown in Monitor > Activity > Agent Activity with your provider dashboard to confirm token counts match

Model Performance Issues

Symptoms: Slow responses or timeouts

Solutions:

Check provider status for outages
Consider switching to a faster model
Optimize prompt length
Review concurrent request limits

Overview​

Understanding Models​

Model Hierarchy​

Model Capabilities​

Context Windows and Output Limits​

Viewing Models​

Navigate to Models​

Model Information​

Adding Models to a Provider​

Using Model Templates​

Manual Model Configuration​

Configuring Pricing​

Set Model Pricing​

Cache Read Discount​

Cache Write Premium​

Pricing Tiers​

Current Pricing Reference​

Anthropic​

OpenAI​

xAI (Grok)​

Google Gemini​

Understanding Cache Pricing​

What Are Cache Reads and Cache Writes?​

How Different Providers Handle Caching​

Why Cache Pricing Matters for Cost Accuracy​

Cost Calculation Formula​

Long Context Surcharges​

What Triggers a Surcharge​

How the Multipliers Work​

Which Providers and Models Have Surcharges​

Model Selection​

Choosing the Right Model​

Task Complexity​

Response Time​

Cost vs Quality​

Model Recommendations by Use Case​

Token Counting​

What Are Tokens?​

Token Types​

Estimating Token Usage​

Viewing Token Usage​

Cost Tracking​

Execution-Level Costs​

Aggregate Views​

Setting Budgets​

Best Practices​

Cost Optimization​

Quality Assurance​

Pricing Updates​

Troubleshooting​

Model Not Available​

Incorrect Cost Calculations​

Model Performance Issues​

Related Topics​

Overview

Understanding Models

Model Hierarchy

Model Capabilities

Context Windows and Output Limits

Viewing Models

Navigate to Models

Model Information

Adding Models to a Provider

Using Model Templates

Manual Model Configuration

Configuring Pricing

Set Model Pricing

Cache Read Discount

Cache Write Premium

Pricing Tiers

Current Pricing Reference

Anthropic

OpenAI

xAI (Grok)

Google Gemini

Understanding Cache Pricing

What Are Cache Reads and Cache Writes?

How Different Providers Handle Caching

Why Cache Pricing Matters for Cost Accuracy

Cost Calculation Formula

Long Context Surcharges

What Triggers a Surcharge

How the Multipliers Work

Which Providers and Models Have Surcharges

Model Selection

Choosing the Right Model

Task Complexity

Response Time

Cost vs Quality

Model Recommendations by Use Case

Token Counting

What Are Tokens?

Token Types

Estimating Token Usage

Viewing Token Usage

Cost Tracking

Execution-Level Costs

Aggregate Views

Setting Budgets

Best Practices

Cost Optimization

Quality Assurance

Pricing Updates

Troubleshooting

Model Not Available

Incorrect Cost Calculations

Model Performance Issues

Related Topics