Skip to main content

Usage Dashboard

Comprehensive cost, performance, and health analytics for all AI agent activity.

Overview

The Usage Dashboard provides deep visibility into how your AI agents are performing, what they're costing, and where potential issues exist. Instead of analyzing individual executions one at a time, the dashboard aggregates data across all agent activity to reveal patterns, trends, and opportunities for optimization.

Use the dashboard to:

  • Track spending across agents, models, and operations
  • Monitor execution health and catch degrading performance early
  • Understand which tools and models are most used
  • Compare Code Mode vs standard execution metrics
  • Analyze costs per mailbox for capacity planning
  • Inspect pre-execution classifier shadow analytics (divergence rate, estimated savings, latency)

The dashboard surfaces insights that would be difficult to spot in raw execution logs, helping you make data-driven decisions about agent configuration, model selection, and resource allocation.

Dashboard Tabs

The Usage Dashboard is organized into the following tabs:

TabPurpose
OverviewCost, health, model usage, and mailbox analytics for all agent activity
Classifier (Shadow)Pre-execution classifier telemetry while operating in shadow mode
VariantsPer-variant CCS hit rates, nightly generation cost, and cost guardrail status

Use the tab bar at the top of the page to switch between views. The period selector applies to all tabs.

Accessing the Usage Dashboard

Navigate to Manage > Account > Usage to view the dashboard.

Period Selection

All dashboard sections share a unified period selector at the top of the page:

PeriodDescription
7 DaysMost recent week of activity (default)
30 DaysLast month for trend analysis
90 DaysQuarterly view for long-term patterns

Changing the period updates all sections simultaneously. Use shorter periods for recent debugging and longer periods for strategic planning.

Overview Tab Sections

1. Daily Cost Trend

Track spending patterns over time with visual correlation between costs and execution volume.

Components:

  • Stacked Area Chart — Shows cost by source type (agents, scans, AICOS) over time
  • Daily Execution Volume Bars — Overlay showing execution count per day
  • KPI Summary Cards:
    • Total Cost — Sum of all costs in the selected period
    • Total Executions — Count of all agent runs
    • Avg Daily Cost — Daily spending average
    • Avg Cost/Execution — Per-execution cost average
  • Token Usage Indicators — Request tokens, response tokens, cache read tokens, cache write tokens, and average tokens per execution

What to look for:

  • Spikes in cost without corresponding execution increases (indicates more expensive operations)
  • Days with no activity (may indicate configuration issues)
  • Steady growth trends (for capacity planning)

Example insights:

  • "Cost doubled on Tuesday despite similar execution volume" → Investigate if agents switched to more expensive models or used more tools
  • "Weekend costs are near zero" → Adjust scheduled tasks or mailbox scan frequency for off-hours

2. Per-Agent Cost Breakdown

Understand which agents drive the most spending and their performance characteristics.

Components:

  • Agent Cost Ranking Table with columns:
    • Agent Name — Display name of the agent
    • Executions — Total runs in the period
    • LLM Calls — Number of model API calls made
    • Tool Calls — Number of tool invocations
    • Avg Duration — Average execution time
    • Success Rate — Percentage of successful completions
    • Tokens — Total tokens consumed
    • Cost — Total spending for this agent
  • Horizontal Bar Chart — Visual cost distribution across agents
  • Knowledge Source Scans — Separate summary row for mailbox scanning operations

What to look for:

  • High-cost agents with low execution counts (expensive per-use cases)
  • Low success rates (may need instruction improvements)
  • Long average durations (performance optimization opportunities)

Example insights:

  • "Sales Assistant has 95% success rate but Finance Agent is at 65%" → Review Finance Agent instructions and tool availability
  • "Customer Support Agent accounts for 60% of costs" → Consider using a smaller model or optimizing prompts

Note on success rate: Escalations are treated as successful outcomes since the agent correctly identified the need for human input.

3. Execution Health & Error Analysis

Monitor system health with composite scoring and detailed error categorization.

Components:

  • Composite Health Score (0-100) — Weighted calculation:
    • 50%: Completion rate (successful executions / total)
    • 30%: Tool success rate (successful tool calls / total)
    • 20%: Error-free rate (executions without errors / total)
  • Status Distribution Donut Chart — Breakdown by completed, failed, escalated, in_progress
  • Duration Metrics with Histogram:
    • Average duration
    • Median duration
    • Minimum duration
    • Maximum duration
    • 95th percentile (p95)
  • Tool Reliability Panel — Success rate for each tool with recent failure indicators
  • Error Categorization — Groups errors into:
    • Tool-only failures (tool errors but execution completed)
    • Execution errors (LLM or system errors)
    • Combined failures (both tool and execution issues)
  • Stale Execution Detection — Alerts for executions in progress >10 minutes

What to look for:

  • Health scores below 80 (indicates systemic issues)
  • Tool reliability below 90% (specific tool may need debugging)
  • Stale executions (may indicate timeout or hang conditions)
  • P95 duration significantly higher than average (outlier performance issues)

Example insights:

  • "Health score dropped from 95 to 75 over the past week" → Recent configuration change or external service degradation
  • "search_knowledge_base tool has 60% success rate" → Check Azure AI Search service health or index configuration

4. LLM Model & Tool Analytics

Understand which models and tools are used most, their costs, and efficiency metrics.

Components:

  • Model Distribution Chart — Pie chart showing proportion of calls per model
  • Model Cost & Token Breakdown Table:
    • Model name
    • LLM calls count
    • Total tokens (including cache read and write tokens)
    • Cache read tokens - tokens served from provider cache at a discounted rate
    • Cache write tokens - tokens written to provider cache (Anthropic only)
    • Total cost
    • Token efficiency ratio (request-to-response ratio)
  • Tool Usage Frequency Bar Chart — Tools ranked by usage with category-based coloring
  • Agent Tool Profiles — Which agents use which tools
  • Tool Name Normalization — Dynamic tool IDs (like GKI and Team Messaging tools) are normalized for cleaner reporting

What to look for:

  • Token efficiency ratios <0.5 (very verbose responses, may be inefficient)
  • Underutilized tools (available but rarely called)
  • Model diversity (are you using the right model for each task?)

Example insights:

  • "Claude Opus 4 is 80% of calls but only 20% of executions" → Certain agents make many iterative LLM calls
  • "search_emails tool has 500 calls but search_knowledge_base only 50" → Email search may be overused vs knowledge sources

5. Per-Mailbox Usage

Track costs and activity per monitored mailbox for capacity planning and optimization.

Components:

  • Horizontal Bar Chart — Cost per mailbox
  • Mailbox Usage Table:
    • Mailbox address
    • Scans (active scan count)
    • LLM Calls
    • Cost
    • Avg Cost/Active Scan
    • Last Scan timestamp
  • Active vs Idle Scans — Distinguishes between scans that processed emails vs scans with no new messages

Visibility: This section is automatically hidden when no mailbox scan data exists.

What to look for:

  • Mailboxes with high costs but low active scans (may be scanning too frequently for little benefit)
  • Mailboxes with no recent scans (may indicate subscription or configuration issues)
  • Cost-per-active-scan outliers (certain mailboxes may receive complex emails)

Example insights:

  • "support@company.com costs $50/day but sales@company.com costs $5/day with similar scan frequency" → Support emails may be more complex or trigger more tool use
  • "Mailbox hasn't scanned in 3 days" → Check Microsoft Graph subscription status

6. Code Mode Analytics

Evaluate the experimental Code Mode feature's adoption, performance, and token savings.

Components:

  • Adoption Summary:
    • Code Mode executions count
    • Adoption rate percentage
    • Code blocks executed
  • Token Savings Highlight Card — Primary value proposition showing cumulative token reduction
  • Side-by-Side Comparison Table — Code Mode vs Standard execution metrics:
    • Average tokens per execution
    • Average cost per execution
    • Success rate
    • Average duration
  • Error Analysis by Code Mode Error Types:
    • Timeout (code execution exceeded time limit)
    • Circuit breaker (safety mechanism triggered)
    • Max turns (exceeded iteration limit)
    • Exception (code runtime error)
  • Per-Agent Code Mode Adoption Table — Which agents are using Code Mode and their performance

Visibility: This section is automatically hidden until Code Mode is actually used (experimental feature).

What to look for:

  • Token savings >30% (strong indicator Code Mode is beneficial for your use cases)
  • Error rates higher than standard execution (may need code generation improvements)
  • Which agents benefit most from Code Mode (for targeted rollout)

Example insights:

  • "Code Mode saves 45% tokens on average with 92% success rate" → Consider enabling for more agents
  • "Timeout errors common on Finance Agent Code Mode executions" → Code is too complex or queries are slow

7. Trigger Type Breakdown

Understand costs across different execution triggers to optimize workflows.

Components:

  • Pie Chart — Cost distribution by trigger type
  • Details Table:
    • Trigger type (friendly label)
    • Executions count
    • LLM Calls
    • Tool Calls
    • Cost

Trigger Types:

  • Inbound Email — Email monitoring and processing
  • Chat — Test Chat and agent conversations
  • Scheduled — Timer-based executions
  • CAIOO Wake-Up — AI Chief of Staff daily cycle
  • CAIOO Goal Management — AICOS working on goals
  • CAIOO Project Management — AICOS managing projects
  • CAIOO Conversation — AICOS email interactions
  • Long-Running Resume — Resumption of paused executions
  • And 5 more variants

What to look for:

  • High costs for scheduled tasks (may be running too frequently)
  • CAIOO variants consuming unexpected budget (tune AICOS settings)
  • Trigger types with low execution counts but high costs (expensive edge cases)

Example insights:

  • "Scheduled tasks cost $100/day but only 20 executions" → Review scheduled reminder frequency
  • "CAIOO Project Management is 40% of costs" → AICOS is very active, consider adjusting wake-up frequency or project scope

Classifier (Shadow) Tab

The Classifier (Shadow) tab surfaces analytics for the pre-execution classifier while it runs in shadow mode. In shadow mode the classifier evaluates each execution and records its routing decision, but does not act on it — agents continue to run as configured. This lets you measure accuracy and potential cost impact before enabling live routing.

Shadow Mode

Classifier decisions are observed and recorded but not applied. No agent behavior changes while the classifier is in shadow mode.

KPI Cards

MetricDescription
Classified ExecutionsNumber of executions the classifier evaluated in the selected period
Divergence RateExecution-weighted rate at which the classifier's decision differed from the static routing rule
Est. Daily SavingsProjected cost reduction if classifier decisions were applied (also shown as weekly)
Timeout RateProportion of classifier calls that exceeded the 1,000 ms budget (>0.5% is highlighted in red)

Classifier Latency

Latency percentiles for the classifier evaluation step:

  • P50 — Median evaluation time
  • P95 — Typical worst-case latency; values above 1,000 ms are highlighted in amber
  • P99 — Tail latency; values above 1,000 ms are highlighted in red

A healthy classifier should keep P95 well under the 1,000 ms budget to avoid impacting execution start time.

Divergence by Trigger Type

A horizontal bar chart showing divergence rate broken down by execution trigger type (Inbound Email, Chat, Scheduled, CAIOO variants, etc.). Trigger types with the highest divergence appear first.

Use this view to identify which workflows benefit most from classifier-based routing.

Confidence Distribution

Shows what fraction of classifier decisions fell into each confidence band:

BandInterpretation
< 60%Low confidence — classifier is uncertain about the routing decision
60% – 80%Moderate confidence
≥ 80%High confidence — classifier is certain about the routing decision

A healthy distribution skews toward the ≥ 80% band. A large proportion of low-confidence decisions may indicate the classifier needs additional tuning for your workload.

No-Data State

If no classifier telemetry has been recorded for the selected period, the tab displays an informational message. The classifier emits data to ExecutionSummaries once executions carry classifier dimensions; new tenants or tenants that recently enabled the classifier may see this state initially.

Variants Tab

The Variants tab shows observability data for Contextual Cognitive State (CCS) variants - lean, focused prompt packages that the pre-execution classifier loads instead of the full CCS. This tab is tagged Phase 3 and becomes meaningful once the classifier begins selecting non-full variants.

KPI Cards

MetricDescription
Non-Full LoadsNumber of times the classifier selected a variant (email, planning, chat, or reactive) instead of the full CCS
Active VariantsCount of variants with at least one load in the period, out of all known variants
Nightly Dream CostTotal cost for the DreamState nightly generation cycles that produce updated variant CCS files
Cost GuardrailCurrent status of the BR-027 cost guardrail (Healthy or Degraded)

Per-Variant Hit Rate

A horizontal bar chart ranking each variant by load count, with its hit rate shown as a percentage of all non-full variant loads. Hit rate is recorded each time the classifier successfully reads a variant's CCS file from blob storage. Fallbacks to the full CCS are not counted.

Variants that are registered but have not been loaded in the selected period still appear in a Known Variants chip list in the no-data state.

Unknown variants (loaded but not in the registered list) are flagged with an unknown badge.

Per-Variant Token and Bloat Metrics

Three metric cards display token injection efficiency and prompt-bloat reduction:

MetricDescription
Avg Tokens InjectedAverage tokens added to the prompt when this variant loads (pending persistence)
Fallback RateShare of variant load attempts that fell back to the full CCS (pending persistence)
Prompt Bloat vs 1.10 BaselinePercentage reduction in prompt size compared to the Story 1.10 baseline (baseline pending)

These metrics show dash placeholders until the underlying data points are persisted to ExecutionSummaries.

Nightly Generation Cost (Per Agent)

A table listing each agent that completed at least one DreamState cycle in the period:

ColumnDescription
AgentFriendly name and internal agent ID
CyclesNumber of nightly generation cycles completed
Total CostCumulative LLM cost for variant generation
Avg / CycleAverage cost per generation cycle

Use this table to identify which agents incur the most nightly overhead and to detect unexpectedly high generation costs.

Cost Guardrail (BR-027)

The guardrail monitors whether the Stage 2 nightly generation cost exceeds 1.20 times the Stage 1 baseline on consecutive nights.

Healthy state: All variants are generating and the nightly cost is within the 1.20x threshold. A recent-breach count and timestamp are shown if the guardrail was tripped in the past but has since recovered.

Degraded state: The guardrail has tripped. The panel shows:

  • How many consecutive nights the threshold was breached
  • When the guardrail tripped
  • Which variants are still generating in degraded mode (limited subset)

Only a manual clear by a platform administrator can exit degraded mode.

No-Data State

If no variant loads or dream cycles are recorded for the selected period, the tab shows the list of known variants as chips and explains that variants populate as the classifier selects non-full lenses and the nightly DreamState cycle generates them. Try a longer period or confirm with your platform administrator that Phase 3 is active.

How Data is Collected

The Usage Dashboard aggregates data from Azure Table Storage's ExecutionSummaries entity, which captures:

  • Every agent execution (email, chat, scheduled)
  • All mailbox scan operations
  • AICOS activity across all modes
  • LLM model calls, token usage (including cache read and write tokens), and costs
  • Tool invocations and their results

Nine dedicated API endpoints serve different analytical views, including a dedicated /api/console/usage-stats/classifier-shadow endpoint for the Classifier tab. The frontend renders interactive charts using the Recharts library with a unified period selector and responsive layouts optimized for both desktop and mobile.

All data is tenant-scoped and loaded in parallel for fast page rendering. No personally identifiable information (PII) from email content is stored in analytics tables.

Data Volume Limit and Truncation Warning

Each dashboard endpoint scans up to 10,000 execution rows for the selected period. When your tenant has more than 10,000 executions in the chosen window, any affected dashboard tab displays a yellow warning banner:

Showing aggregates over the most recent 10,000 executions. Older rows in this period were not included.

This means aggregated metrics - divergence rate, estimated savings, per-routing-class volumes, costs, and health scores - reflect only the most recent 10,000 executions, not the full period. High-volume tenants (typically 50+ mailboxes at moderate scan frequency) are most likely to encounter this limit on 30-day or 90-day views.

Mitigations:

  • Use a shorter period (7 days) to stay within the limit.
  • Contact your platform administrator if you regularly see this warning - pre-aggregated roll-ups are on the product roadmap for high-volume tenants.

Best Practices

Regular Monitoring

  • Daily: Check health score and error analysis for immediate issues
  • Weekly: Review per-agent costs and look for optimization opportunities
  • Monthly: Analyze 30 or 90-day trends for capacity planning

Cost Optimization Strategies

  1. High-cost agents: Consider smaller models or prompt optimization
  2. Underused tools: Remove from agents to reduce prompt token overhead
  3. Expensive models for simple tasks: Use GPT-4o-mini or Claude Haiku where appropriate
  4. Frequent mailbox scans with few active scans: Reduce scan frequency

Performance Optimization

  1. Low health scores: Investigate recent configuration changes or external service issues
  2. Tool reliability <90%: Debug specific tools or replace with alternatives
  3. P95 duration outliers: Find and optimize slow executions
  4. Stale executions: Check for timeout issues or long-running processes

Model Selection

Use the Model Analytics section to:

  • Compare token efficiency ratios across models
  • Identify tasks that would benefit from smaller/faster models
  • Understand which agents drive most LLM costs
  • Evaluate if premium models (Opus, GPT-4) are necessary for all use cases

Code Mode Evaluation

If experimenting with Code Mode:

  • Target >30% token savings for meaningful ROI
  • Monitor error rates closely (should be comparable to standard execution)
  • Start with agents that have well-defined, structured tasks
  • Review per-agent adoption to find best-fit use cases

Troubleshooting

Dashboard Shows a Truncation Warning Banner

A yellow banner reading "Showing aggregates over the most recent 10,000 executions" appears on one or more tabs when the selected period contains more than 10,000 executions.

Impact: Metrics on that tab (costs, health scores, divergence rate, estimated savings, per-class volumes) are computed from the most recent 10,000 executions only. The banner is informational - data is still shown, but may understate activity for the oldest portion of the period.

Solution: Switch to a shorter period (7 days) to stay under the limit. If this warning appears consistently and accurate full-period aggregation is important for your workflows, contact your platform administrator.

Dashboard Shows No Data

This can occur when:

  • No agent executions have completed in the selected period
  • The tenant is newly created
  • System clock issues (check execution timestamps)

Solution: Verify agents are configured and triggered at least once. Check Monitor > Activity > Agent Activity for recent executions.

Costs Don't Match Provider Billing

The dashboard calculates costs based on:

  • Model pricing configured in Build > AI Configuration > LLM Models
  • Token counts from API responses
  • Does NOT include provider fees, taxes, or enterprise discounts

Solution: Use the dashboard for relative cost tracking and trends. Compare with provider billing statements monthly to ensure pricing configuration accuracy.

Health Score Seems Wrong

The composite health score weights completion rate highest (50%), tool success rate next (30%), and error-free rate last (20%). This means:

  • An agent with 90% completion but 50% tool success = ~72 health score
  • An agent with 100% completion and tool success but 10% with non-fatal errors = ~98 health score

Solution: Review individual components (status distribution, tool reliability, error categories) for root cause instead of relying solely on the composite score.

Classifier Tab Shows "No Data"

The Classifier (Shadow) tab displays a no-data message when no classifier telemetry has been recorded for the selected period.

Possible causes:

  • The pre-execution classifier is not yet enabled for your tenant
  • The selected period predates classifier activation
  • Executions have not yet carried classifier dimensions to ExecutionSummaries

Solution: Confirm with your platform administrator that the pre-execution classifier is active. Try a longer period (30 or 90 days) to capture historical data. Once at least one execution records classifier dimensions, the tab will show data.

Classifier Timeout Rate Is High

A timeout rate above 0.5% means the classifier is frequently exceeding its 1,000 ms budget, which may add latency to execution starts.

Solution: Contact support. High timeout rates may indicate infrastructure resource pressure or a classifier configuration issue.

Variants Tab Shows No Data

The Variants tab shows a no-data state when no CCS variant loads or DreamState cycles are recorded for the selected period.

Possible causes:

  • The pre-execution classifier is not yet selecting non-full variants (Phase 3 feature)
  • The selected period predates Phase 3 activation
  • No nightly DreamState cycles have run yet for any agent

Solution: Confirm with your platform administrator that Phase 3 is active. Try a longer period (30 or 90 days). Once the classifier picks at least one non-full variant or a DreamState cycle completes, the tab shows data.

Cost Guardrail Shows Degraded

The BR-027 guardrail trips when the Stage 2 nightly DreamState generation cost exceeds 1.20 times the Stage 1 baseline on consecutive nights. In degraded mode only a limited subset of variants is generated.

Solution: Contact your platform administrator. Manual intervention is the only way to clear the degraded state. Until cleared, the Variants tab KPI card shows Degraded in red.

Code Mode Section Not Visible

Code Mode Analytics is automatically hidden until at least one execution uses Code Mode.

Solution: Enable Code Mode on an agent (experimental feature) and trigger at least one execution. The section will appear after data is available.