Skip to main content

Safety Gateway Overview

The Safety Gateway is an LLM-powered security layer that analyzes all outbound communications from AI agents before they are sent. It helps prevent sensitive information leakage, inappropriate responses, and other risks associated with automated email communications.

Purpose

When AI agents process emails and generate responses, there's always a risk that:

  • Sensitive data (credit cards, SSNs, bank accounts) could be accidentally included
  • Confidential business information could be shared with external parties
  • Responses could be inappropriate for the recipient or context
  • PII (Personally Identifiable Information) could leak to unauthorized recipients

The Safety Gateway provides a comprehensive defense against these risks through:

  1. PII Detection - Regex-based scanning for credit cards, SSNs, bank accounts, DOBs, passports, and driver's licenses
  2. LLM Analysis - AI-powered content analysis to detect subtle risks that pattern matching misses
  3. Custom Instructions - Organization-specific policies and guidelines for the safety analysis
  4. Recipient Classification - Automatic identification of internal vs. external recipients
  5. Human Review Queue - Suspicious messages are held for human approval before sending
  6. Analytics Dashboard - Interactive charts showing decision trends, top flags, and classification breakdowns

How It Works

The Safety Gateway intercepts outbound communications at the tool execution layer:

  1. Agent generates response - The AI agent decides to send an email, SMS, or other communication
  2. Tool execution intercepted - Before reply_email, send_sms, etc. execute, the Safety Gateway analyzes the content
  3. Safety analysis performed:
    • PII patterns scanned (credit cards, SSNs, etc.)
    • LLM analyzes content for sensitive information
    • Recipients classified as internal or external
    • Risk score calculated (0.0 - 1.0)
  4. Decision made:
    • Allow - Low risk, tool executes immediately
    • Hold - Medium risk, queued for human review
    • Deny - High risk, tool execution blocked
Agent Generates Response
|
v
Tool Execution
(reply_email, etc.)
|
v
+---------------------+
| Safety Gateway |
| |
| * PII Scanning |
| * LLM Analysis |
| * Risk Scoring |
+---------------------+
|
v
Risk Decision
/ \
v v
Allow Hold/Deny
| |
v v
Execute Queue for
Tool Review

Decision Types

The Safety Gateway makes one of four decisions for each outbound communication:

DecisionActionWhen Used
AllowMessage is sent immediatelyLow risk score, no PII detected, internal recipients
HoldQueued for human reviewMedium risk score, uncertain content, external recipients
DenyMessage is blockedHigh risk score, clear policy violation
Allow (Dry Run)Message sent, logged for reviewGateway in observation mode

Risk Scoring

Each message receives a danger score from 0.0 to 1.0:

  • 0.0 - 0.3: Low risk - typically auto-approved
  • 0.3 - 0.6: Medium risk - may require review depending on recipient type
  • 0.6 - 0.8: High risk - typically held for review
  • 0.8 - 1.0: Critical risk - may be auto-denied

Risk thresholds are configurable separately for:

  • Internal recipients (within your organization)
  • External recipients (outside your organization)

Gateway Modes

The Safety Gateway operates in one of three modes:

Disabled

  • Safety Gateway is completely bypassed
  • All outbound communications execute immediately
  • Use only for testing or when confident in agent behavior

Dry Run (Observation)

  • All communications are allowed through
  • Every message is analyzed and logged
  • Helps tune thresholds before enabling enforcement
  • Shows "simulated decision" in audit log

Enforced

  • Full protection enabled
  • High-risk messages are held or denied
  • Human review required for held items
  • Recommended for production use

Pages

The Safety Gateway section of the Admin Console includes three pages, all accessible under Build > Governance > Safety Gateway:

  1. Safety Settings - Configure thresholds, modes, custom instructions, and LLM analysis options
  2. Review Queue - Approve, deny, or edit held messages with re-scan validation
  3. Audit Log - View history of all gateway decisions with analytics dashboard

Best Practices

Initial Setup

  1. Start with Dry Run mode to observe what would be flagged
  2. Review the audit log to understand your traffic patterns
  3. Adjust thresholds based on your risk tolerance
  4. Enable Enforced mode when confident in configuration

Threshold Configuration

  • Conservative (Lower thresholds): More messages held for review, higher overhead but safer
  • Permissive (Higher thresholds): Fewer interruptions, faster processing but higher risk
  • Asymmetric: Lower thresholds for external recipients, higher for internal

Queue Management

  • Review pending items promptly to avoid blocking agent workflows
  • Use bulk actions for similar items
  • Add reviewer notes to build institutional knowledge
  • Consider time-sensitive flags for urgent communications

Technical Details

Intercepted Tools

The Safety Gateway intercepts these outbound communication tools:

  • reply_email - Email replies
  • reply_all_email - Reply-all emails
  • forward_email - Email forwarding
  • create_draft_reply - Draft email creation
  • send_sms - SMS/text messages
  • notify_user - User notifications
  • schedule_meeting - Meeting invitations
  • email_manager_and_wait - Manager escalations

Recipient Classification

The Safety Gateway automatically classifies recipients as internal or external using:

  1. Company Domains - Primary lookup against registered company domains in the Companies table (including alternate domains)
  2. Employee Directory - Secondary lookup if domain not found; requires 2+ employees with matching email domain
  3. Domain Heuristics - Falls back to treating unknown domains as external

To ensure accurate internal classification:

  • Register all company domains in the Companies settings
  • Include alternate domains (e.g., subsidiary domains, legacy domains)
  • Sync your employee directory regularly

PII Types Detected

TypeExamples
Credit Card4111-1111-1111-1111, 5500 0000 0000 0004
SSN123-45-6789, 123 45 6789
Bank AccountRouting + account number patterns
Date of BirthDOB: 01/15/1990, born on 1990-01-15
PassportPassport: 123456789, Passport#AB1234567
Driver's LicenseDL: D12345678, License: ABC12345

PII Masking

When PII masking is enabled (default), detected PII is automatically masked before sending:

  • Credit cards: ****-****-****-1234 (last 4 digits preserved)
  • SSNs: ***-**-6789 (last 4 digits preserved)
  • Bank accounts: Masked with partial visibility
  • Other PII: Replaced with [REDACTED]

This adds an extra layer of protection even if a message is approved for sending.

LLM Analysis

The Safety Gateway uses an LLM to analyze message content for risks that pattern matching cannot detect:

  • Confidential business information
  • Inappropriate tone or language
  • Context-inappropriate disclosures
  • Policy violations
  • Potential phishing or social engineering
  • Unusual requests or instructions

The LLM returns:

  • Danger Score (0.0 - 1.0)
  • Analysis Summary explaining the reasoning
  • Flags indicating specific risk types detected

You can configure which LLM model to use for analysis in Safety Settings.

Storage

  • Review Queue: SQL Server (SafetyReviewQueue table)
  • Audit Log: Azure Table Storage (high-volume, cost-effective)
  • Settings: SQL Server (TenantSafetySettings table)

Emergency Override

In urgent situations, administrators can activate an Emergency Override to temporarily disable the Safety Gateway:

  • Requires safety:emergency:override permission
  • Must provide a reason (minimum 20 characters)
  • Duration: 15-240 minutes
  • Automatically reverts to previous mode when expired
  • All overrides are logged for compliance
  • Rate-limited to 3 activations per hour per user