Skip to main content

LLM Models

Configure models with pricing information for accurate cost tracking across all billing dimensions.

Overview

LLM Models represent the specific AI models available from your configured providers. Control Bridge automatically discovers available models, but you can configure pricing information to enable accurate cost tracking.

Modern LLM providers charge for more than just input and output tokens. Cache reads, cache writes, and long context requests each have their own pricing dimensions. Configuring all of these ensures the Usage Dashboard reflects your actual provider billing.

Understanding Models

Model Hierarchy

Provider (e.g., Anthropic)
└── Model (e.g., claude-sonnet-4-6)
├── Input pricing (per million tokens)
├── Output pricing (per million tokens)
├── Cache read discount (fraction of input price)
├── Cache write premium (multiplier on input price)
└── Pricing tiers (long context surcharge configuration)

Model Capabilities

Different models offer different tradeoffs:

Model TierSpeedIntelligenceCostBest For
Small (Haiku, GPT-4o-mini)FastGoodLowSimple tasks, high volume
Medium (Sonnet, GPT-4o)ModerateExcellentMediumGeneral use, balanced
Large (Opus, GPT-5.2)SlowerBestHighComplex reasoning

Context Windows and Output Limits

Context window is the maximum number of tokens a model can process in a single request (input + output combined). Max output is the ceiling for generated response tokens.

ProviderModelContext WindowMax Output
AnthropicClaude Opus 4.61,000,000128,000
AnthropicClaude Sonnet 4.61,000,000128,000
AnthropicClaude Opus/Sonnet 4.5200,00064,000
AnthropicClaude Haiku 4.5200,00064,000
OpenAIGPT-5.2400,000128,000
OpenAIGPT-5 Mini400,000128,000
OpenAIGPT-4o / GPT-4o Mini128,00016,384
xAIGrok 4256,000100,000
xAIGrok 4.1 Fast2,000,000100,000
GeminiGemini 2.5 Pro / Flash1,000,00065,536

Context window sources (verified 2026-03-15):

Viewing Models

Go to Build > Governance > AI Providers and you will see models listed under each provider.

Model Information

For each model, you can see:

  • Model ID - The technical identifier used by the provider
  • Display Name - Human-readable name
  • Input Price - Cost per million input tokens
  • Output Price - Cost per million output tokens
  • Cached Input Discount - Multiplier applied to cache read tokens
  • Cache Write Premium - Multiplier applied to cache write tokens
  • Pricing Tiers - Long context surcharge configuration (JSON)
  • Status - Whether the model is available

Adding Models to a Provider

Using Model Templates

When adding a new model to a provider, you can use pre-configured templates that automatically populate model specifications:

  1. Go to Build > Governance > AI Providers
  2. Click on a provider to expand it
  3. Click Add Model
  4. Select a model from the Use a template dropdown
  5. The form auto-fills with:
    • Model identifier
    • Display name
    • Context window size
    • Input/output pricing
    • Cache pricing values
    • Default temperature
    • Tool calling and streaming support
  6. Customize any values if needed
  7. Click Save
tip

Templates are sourced from the global model database and include up-to-date pricing and capabilities for common models, including cache pricing defaults.

Manual Model Configuration

If no template is available for your model:

  1. Click Add Model on a provider
  2. Enter the model information manually:
    • Model Name - The API identifier (e.g., claude-sonnet-4-6)
    • Display Name - Human-readable name
    • Context Window - Maximum tokens the model can process
    • Pricing - Input and output cost per million tokens
    • Cache Pricing - Cache read discount and write premium
    • Pricing Tiers - Long context surcharge JSON (if applicable)
    • Capabilities - Tool calling, streaming support
  3. Click Save

Configuring Pricing

Accurate pricing enables:

  • Cost tracking per execution
  • Cost analysis by agent
  • Budget monitoring and alerts
  • ROI calculations including cache efficiency

Set Model Pricing

  1. Go to Build > Governance > AI Providers and click on a model to edit
  2. Enter the Input Price (per million tokens)
  3. Enter the Output Price (per million tokens)
  4. Enter the Cached Input Discount (multiplier for cache read tokens)
  5. Enter the Cache Write Premium (multiplier for cache write tokens)
  6. Enter Pricing Tiers JSON if the model has a long context surcharge
  7. Click Save

Cache Read Discount

The Cached Input Discount is a multiplier applied to the input price for cache read tokens. Values less than 1.0 represent a discount:

ValueMeaningEffective cost
0.1090% discountCache reads cost 10% of standard input price
0.2575% discountCache reads cost 25% of standard input price
0.5050% discountCache reads cost 50% of standard input price
1.00No discountCache reads charged at full input rate

Cache Write Premium

The Cache Write Premium is a multiplier applied to the input price for cache write (cache creation) tokens. Values greater than 1.0 represent a premium:

ValueMeaningEffective cost
1.00No premiumCache writes charged at standard input rate
1.2525% premiumCache writes cost 125% of standard input price

Pricing Tiers

The Pricing Tiers field accepts a JSON object that configures long context surcharges. Leave it empty (null) if the model has no surcharge:

{
"longContext": {
"thresholdTokens": 200000,
"inputMultiplier": 2.0,
"outputMultiplier": 1.5
}
}
  • thresholdTokens - Input token count above which the surcharge activates
  • inputMultiplier - Input price multiplier when above threshold (2.0 = double price)
  • outputMultiplier - Output price multiplier when above threshold (1.5 = 50% more)

Current Pricing Reference

Prices as of March 2026 (verify with provider):

Anthropic

ModelInput (per 1M)Output (per 1M)Cache Read DiscountCache Write PremiumLong Context
Claude Opus 4.6$5.00$25.000.10 (90% off)1.25 (25% premium)None (flat rate to 1M)
Claude Sonnet 4.6$3.00$15.000.10 (90% off)1.25 (25% premium)>200k: $6/$22.50
Claude Opus 4.5$5.00$25.000.10 (90% off)1.25 (25% premium)>200k: 2x/1.5x
Claude Sonnet 4.5$3.00$15.000.10 (90% off)1.25 (25% premium)>200k: 2x/1.5x
Claude Haiku 4.5$1.00$5.000.10 (90% off)1.25 (25% premium)>200k: 2x/1.5x

Most Anthropic models have a long context surcharge at 200,000 tokens (2x input, 1.5x output). The exception is Claude Opus 4.6, which charges a flat rate up to 1M tokens with no surcharge.

OpenAI

ModelInput (per 1M)Output (per 1M)Cache Read DiscountCache Write PremiumLong Context
GPT-5.2$1.75$14.000.10 (90% off)1.00 (none)>200k: 2x/1.5x
GPT-5 Mini$0.25$2.000.10 (90% off)1.00 (none)>200k: 2x/1.5x
GPT-4o$2.50$10.000.50 (50% off)1.00 (none)>200k: 2x/1.5x
GPT-4o Mini$0.15$0.600.50 (50% off)1.00 (none)>200k: 2x/1.5x

OpenAI caching is automatic with no write premium. GPT-5 family models get a 90% cache discount; GPT-4o family gets 50%. Legacy models (GPT-4, GPT-3.5 Turbo) do not have long context surcharges.

xAI (Grok)

ModelInput (per 1M)Output (per 1M)Cache Read DiscountCache Write Premium
Grok 4$3.00$15.000.25 (75% off)1.00 (none)
Grok 4.1 Fast$0.20$0.500.25 (75% off)1.00 (none)

xAI models do not have a long context surcharge. Caching is automatic.

Google Gemini

ModelInput (per 1M)Output (per 1M)Cache Read DiscountCache Write PremiumLong Context
Gemini 2.5 Pro$1.25$10.000.10 (90% off)1.00 (none)>200k: $2.50/$15
Gemini 2.5 Flash$0.30$2.500.10 (90% off)1.00 (none)>200k: 2x/1.5x

Gemini models have a long context surcharge at 200,000 tokens: 2x input, 1.5x output.

Pricing Sources (verified 2026-03-15):

warning

Prices change frequently. Always verify current rates with your provider before configuring pricing. The prices above were last verified on March 15, 2026.

Understanding Cache Pricing

What Are Cache Reads and Cache Writes?

LLM providers cache portions of prompts (typically system prompts and tool definitions) to avoid reprocessing identical content on every request. When a request reuses cached content:

  • Cache read - The provider retrieves previously processed prompt content from cache. Because the provider already processed this content, it charges a steep discount (typically 75-90% off standard input price).
  • Cache write - The first time content is processed and stored in cache. Anthropic charges a small premium for this initial storage (25% above input price). Other providers like OpenAI handle caching automatically with no explicit write charge.

How Different Providers Handle Caching

Provider implementations differ in important ways:

ProviderCache writesCache readsWrite premiumRead discount
AnthropicExplicit - you can see cache_creation_input_tokensExplicit - cache_read_input_tokens1.25x0.10x (90% off)
OpenAI / AzureAutomatic - no separate field reportedReported in cached_tokens (subset of prompt_tokens)None (1.00x)0.50x (50% off)
xAI (Grok)AutomaticTracked internallyNone (1.00x)0.25x (75% off)
Google GeminiExplicit via Content Caching APIcached_content_token_countNone (1.00x)0.10x (90% off)

Because providers report tokens differently, Control Bridge applies a provider-aware formula when calculating costs. Anthropic's input_tokens field includes cache write tokens but excludes cache read tokens. OpenAI's prompt_tokens field includes cache read tokens as a subset. The cost engine handles these differences automatically.

Why Cache Pricing Matters for Cost Accuracy

Without accurate cache pricing, cost reporting can be significantly skewed:

  • Cache reads overcharged - If cache read tokens are billed at the full input rate instead of the discounted rate, costs appear higher than actual provider billing.
  • Cache writes undercharged - If Anthropic cache write tokens are not tracked with the 1.25x premium, costs appear lower than actual billing.
  • Net effect varies by workload - For agents with large, stable system prompts that benefit from caching, the reads discount typically outweighs the write premium. Configuring both correctly is essential for accurate cost tracking.

Cost Calculation Formula

Control Bridge uses a multi-dimensional cost formula for each LLM call:

totalCost =
(standardInputTokens / 1M) * inputPrice * inputMultiplier
+ (cacheReadTokens / 1M) * inputPrice * cachedInputDiscount * inputMultiplier
+ (cacheWriteTokens / 1M) * inputPrice * cacheWritePremium * inputMultiplier
+ (outputTokens / 1M) * outputPrice * outputMultiplier

Where:

  • standardInputTokens - Non-cached, non-write input tokens (provider-specific calculation)
  • cachedInputDiscount - The Cached Input Discount configured on the model (e.g., 0.10)
  • cacheWritePremium - The Cache Write Premium configured on the model (e.g., 1.25)
  • inputMultiplier and outputMultiplier - 1.0 normally; elevated to 2.0/1.5 when long context surcharge applies

Long Context Surcharges

What Triggers a Surcharge

When the total input tokens in a request exceed 200,000 tokens, most major providers charge a higher rate for that entire request. This threshold applies to the full context window including system prompts, conversation history, and document content.

How the Multipliers Work

When a request crosses the threshold, all input-category costs (standard input, cache reads, and cache writes) are multiplied by the inputMultiplier, and output costs by the outputMultiplier. The most common configuration across providers is:

  • Input multiplier: 2.0 (double the standard input rate)
  • Output multiplier: 1.5 (50% more than standard output rate)

This applies to the entire request - not just the tokens above the threshold.

Which Providers and Models Have Surcharges

ProviderThresholdInput multiplierOutput multiplierNotes
Anthropic200,000 tokens2.0x1.5xAll Claude models
OpenAI200,000 tokens2.0x1.5xModern models (GPT-4o, GPT-5). Legacy GPT-4 and GPT-3.5 are exempt
Azure AI Foundry200,000 tokens2.0x1.5xModern models (gpt-4o, gpt-4o-mini, gpt-4-turbo). Legacy gpt-4, gpt-4-32k, gpt-35-turbo are exempt
Google Gemini200,000 tokens2.0x1.5xAll Gemini models
xAI (Grok)N/AN/AN/ANo long context surcharge
info

Long context surcharges are configured via the Pricing Tiers JSON field on each model. Models without a long context surcharge should have this field left empty.

Model Selection

Choosing the Right Model

Consider these factors when selecting models for agents:

Task Complexity

  • Simple: Classification, routing, simple Q&A - Haiku / GPT-4o-mini
  • Moderate: Customer support, content generation - Sonnet / GPT-4o
  • Complex: Analysis, multi-step reasoning - Opus / GPT-4 Turbo

Response Time

  • Smaller models respond faster
  • Consider user expectations
  • Batch processing can use larger models

Cost vs Quality

Calculate expected costs using the full multi-dimensional formula:

Daily cost = (
avg_standard_input_tokens * inputPrice
+ avg_cache_read_tokens * inputPrice * cachedInputDiscount
+ avg_cache_write_tokens * inputPrice * cacheWritePremium
+ avg_output_tokens * outputPrice
) * daily_executions / 1,000,000

For agents with large, stable system prompts, cache reads can significantly reduce the effective input cost. Use the Usage Dashboard's cache token data to understand your actual cache hit rates.

Model Recommendations by Use Case

Use CaseRecommended ModelReasoning
Email triageClaude HaikuFast, cost-effective for classification
Customer supportClaude SonnetGood balance of quality and cost
Technical supportClaude SonnetHandles complexity well
Executive summariesClaude OpusHighest quality output
Data extractionClaude HaikuStructured tasks do not need large models

Token Counting

What Are Tokens?

Tokens are the units LLMs use to process text:

  • ~4 characters = 1 token (English)
  • ~3/4 words = 1 token
  • Code and special characters may use more tokens

Token Types

Each LLM call can produce several categories of tokens:

Token typeDescriptionBilling
Input tokensNew, non-cached prompt contentStandard input price
Cache read tokensPrompt content retrieved from provider cacheDiscounted input price
Cache write tokensPrompt content written to provider cache for the first timePremium input price (Anthropic only)
Output tokensGenerated response contentOutput price

Estimating Token Usage

For a typical email interaction:

  • System prompt: 200-500 tokens
  • Email content: 100-1,000 tokens
  • Response: 100-500 tokens

Average execution: ~1,000-2,000 total tokens. Agents with large system prompts benefit more from caching because the system prompt content is reused across calls.

Viewing Token Usage

Check token usage in Monitor > Activity > Agent Activity:

  1. Click on an execution
  2. View the Metrics section
  3. See input, output, cache read, and cache write token counts

Aggregate cache token totals by agent and day are available in Manage > Account > Usage.

Cost Tracking

Execution-Level Costs

Each execution records:

  • Standard input tokens used
  • Cache read tokens (from provider cache)
  • Cache write tokens (cache creation, Anthropic only)
  • Output tokens generated
  • Calculated cost using the full multi-dimensional formula

Aggregate Views

In Manage > Account > Usage, view:

  • Total costs by time period
  • Costs by provider
  • Costs by agent
  • Cost trends
  • Cache token totals per agent per day

Setting Budgets

While Control Bridge does not enforce budgets automatically, you can:

  1. Monitor costs in the Usage dashboard
  2. Set up alerts for unusual spending
  3. Review high-cost executions
  4. Adjust model selection for cost optimization

Best Practices

Cost Optimization

  1. Start with smaller models - Upgrade only if quality is insufficient
  2. Test model changes - Compare quality before switching
  3. Monitor outliers - Investigate unexpectedly expensive executions
  4. Optimize prompts - Shorter prompts mean fewer input tokens
  5. Leverage caching - Stable system prompts benefit from cache reads at a steep discount

Quality Assurance

  1. Review sample outputs - Regularly check agent responses
  2. A/B test models - Compare different models on same inputs
  3. User feedback - Track response quality ratings
  4. Adjust as needed - Upgrade models for struggling agents

Pricing Updates

  1. Check quarterly - Provider pricing changes frequently
  2. Update immediately - When pricing changes affect billing
  3. Document changes - Track pricing history
  4. Recalculate budgets - After significant price changes

Troubleshooting

Model Not Available

Symptoms: Model shows as unavailable or missing

Solutions:

  1. Verify provider API key has access to the model
  2. Check if model requires special access (waitlist)
  3. Confirm model ID is correct

Incorrect Cost Calculations

Symptoms: Costs do not match provider billing

Solutions:

  1. Verify input and output pricing is configured correctly
  2. Verify the Cached Input Discount matches your provider's cache read rate (e.g., 0.10 for Anthropic, 0.50 for OpenAI)
  3. Verify the Cache Write Premium is set correctly (1.25 for Anthropic, 1.00 for OpenAI and others)
  4. Check whether the model has a long context surcharge configured in Pricing Tiers - missing this will undercount costs for requests over 200,000 tokens
  5. Compare the token breakdown in Monitor > Activity > Agent Activity with your provider dashboard to confirm token counts match

Model Performance Issues

Symptoms: Slow responses or timeouts

Solutions:

  1. Check provider status for outages
  2. Consider switching to a faster model
  3. Optimize prompt length
  4. Review concurrent request limits