Skip to main content

ElevenLabs Audio & Video Integration

Connect ElevenLabs to give your AI agents powerful audio and video capabilities including text-to-speech, transcription, sound effects, music generation, dubbing, voice transformation, and audio cleanup.

Overview

The ElevenLabs Audio & Video Integration provides AI agents with 7 tools for generating and processing audio and video content. Agents can convert text to natural-sounding speech, transcribe recordings with speaker identification, generate sound effects and music from descriptions, dub content into 32+ languages, transform voices, and remove background noise from recordings.

All generated files are stored securely in Azure Blob Storage and returned to agents as time-limited, read-only URLs. The integration connects via a single API key from your ElevenLabs account, and all 7 tools are bundled into an "ElevenLabs Audio & Video" tool group ready to assign to any agent.

info

Voice cloning is intentionally excluded from this integration. While ElevenLabs supports voice cloning through their API, this capability is not exposed to agents due to consent and legal considerations around cloning voices without explicit authorization.

Use Cases

  • Voiceover Production - Generate narration for product announcements, training materials, or presentations using natural-sounding AI voices
  • Meeting Transcription - Convert meeting recordings to searchable text with speaker identification and timestamps
  • Content Localization - Dub training videos, marketing content, or customer communications into multiple languages while preserving speaker voice and emotion
  • Sound Design - Create notification sounds, background music, or sound effects for apps and presentations from natural language descriptions
  • Audio Cleanup - Remove background noise from recordings before sharing or archiving
  • Podcast Production - Transcribe episodes, generate intro/outro music, and clean up audio quality

How It Works

Admin connects ElevenLabs    7 tools created                 Agents use tools
via API key in tool group during execution
| | |
v v v
+-----------------+ +---------------------+ +---------------------+
| Enter API key | | text_to_speech | | Generate speech, |
| from ElevenLabs | --> | transcribe | --> | transcribe audio, |
| account | | sound_effects | | create SFX/music, |
| | | generate_music | | dub content, clean |
| | | dub, voice_changer | | up recordings |
| | | audio_isolation | | |
+-----------------+ +---------------------+ +---------------------+
|
v
+---------------------+
| Files stored in |
| Azure Blob Storage |
| with SAS URL access |
+---------------------+

Getting Started

Prerequisites

Before connecting ElevenLabs:

  1. Pro Plus+ Subscription - The ElevenLabs integration requires the custom.elevenlabs feature code on your subscription
  2. ElevenLabs Account - You need an active ElevenLabs account with a paid plan (Creator or higher recommended for sufficient character quota)
  3. ElevenLabs API Key - Generate an API key at elevenlabs.io/app/settings/api-keys
  4. Control Bridge Admin Access - You must be a Control Bridge administrator to configure the integration

Step 1: Connect ElevenLabs

  1. Navigate to Build > Connections > ElevenLabs Audio & Video
  2. Click Connect ElevenLabs
  3. Enter your ElevenLabs API key
  4. The system validates the key against the ElevenLabs API and retrieves your account details
  5. Upon successful validation, 7 tools are created and bundled into an "ElevenLabs Audio & Video" tool group
tip

You can find your API key in the ElevenLabs dashboard under Profile Settings > API Keys. Keep your API key secure - it provides full access to your ElevenLabs account's capabilities and character quota.

Step 2: Verify the Connection

After connecting:

  1. The connection page shows your account status, subscription tier, and character usage
  2. Click Test Connection to verify the integration is working
  3. The system confirms the API key is valid and refreshes your character count and limit

Step 3: Assign the Tool Group to Agents

ElevenLabs tools are bundled into an "ElevenLabs Audio & Video" tool group automatically created at connection time:

  1. Navigate to Build > AI Agents > Agents
  2. Edit the agent that should have audio/video capabilities
  3. Go to the Tools tab
  4. In the Tool Groups section, enable the ElevenLabs Audio & Video group
  5. Save the agent

All 7 ElevenLabs tools are assigned together as a single unit via the tool group.

warning

Only assign the ElevenLabs tool group to agents that genuinely need audio/video capabilities. ElevenLabs usage consumes characters from your account quota, and generated content uses Azure Blob Storage. Monitor your character usage on the connection page.

Available Tools

When ElevenLabs is connected, 7 tools are created and grouped under the "ElevenLabs Audio & Video" tool group.

1. Text to Speech (elevenlabs_text_to_speech)

Convert text to natural-sounding speech audio with voice selection and model quality options.

ParameterTypeRequiredDefaultDescription
textstringYes-The text to convert to speech (max 5,000 characters)
voicestringNoRachelVoice name or ID. Available voices: Rachel, Domi, Bella, Antoni, Elli, Josh, Arnold, Adam, Sam. Use a voice ID for custom voices.
modelstringNomultilingual_v2TTS model: flash (fastest, ~75ms), turbo (balanced), multilingual_v2 (high quality), v3 (emotionally rich)
languagestringNoAuto-detectISO 639-1 language code (e.g., en, es, fr)
output_formatstringNomp3Audio format: mp3 (universal playback), opus (smaller file size), pcm (raw audio data)
stabilitynumberNo0.5Voice stability (0-1). Lower = more expressive, higher = more consistent
similaritynumberNo0.75Voice similarity boost (0-1). Higher = closer to original voice

Returns a URL to the generated audio file along with the format, voice, model, and character count used.

Example queries agents can answer:

  • "Read this announcement aloud" - Converts text to speech with default settings
  • "Create a Spanish voiceover for this script" - Uses language='es' with multilingual model
  • "Generate a quick audio preview" - Uses model='flash' for fastest generation

2. Transcribe (elevenlabs_transcribe)

Convert audio or video files to text transcripts with speaker diarization and timestamp options.

ParameterTypeRequiredDefaultDescription
file_urlstringYes-URL to the audio/video file (HTTPS). Supports MP3, WAV, M4A, MP4, WebM. Max 50MB.
languagestringNoAuto-detectISO 639-1 language code for optimization
speakersintegerNoAuto-detectExpected number of speakers for diarization (1-32)
timestampsstringNowordTimestamp granularity: none, word, or character
formatstringNotextOutput format: text (plain transcript), srt (subtitles), json (structured with timestamps)

Returns the transcript text, detected language, audio duration, and number of speakers detected. Transcripts longer than 10,000 characters are truncated with a notice.

Output formats:

  • text - Plain text transcript, suitable for most use cases
  • srt - SRT subtitle format with timing markers, useful for video captioning
  • json - Structured output with word-level timestamps and speaker labels

3. Sound Effects (elevenlabs_sound_effects)

Generate sound effects from natural language descriptions.

ParameterTypeRequiredDefaultDescription
descriptionstringYes-Text description of the desired sound (e.g., "thunder rumbling in the distance")
durationnumberNoAutoDuration in seconds (0.5-30). Omit for optimal auto-determined length.
loopbooleanNofalseCreate a smoothly looping sound effect
prompt_influencenumberNo0.3How closely to follow the description (0-1). Lower = more creative, higher = more literal.

Returns a URL to the generated audio file.

Example descriptions:

  • "Gentle notification chime"
  • "Thunder rumbling in the distance"
  • "Keyboard typing sounds"
  • "Ocean waves crashing on rocks"

4. Generate Music (elevenlabs_generate_music)

Create background music and compositions from text descriptions.

ParameterTypeRequiredDefaultDescription
descriptionstringYes-Text description of the desired music (e.g., "upbeat corporate background music with light piano")
durationnumberNo30Duration in seconds (5-300)
instrumentalbooleanNotrueInstrumental only, no vocals

Returns a URL to the generated audio file.

Example descriptions:

  • "Upbeat corporate piano with subtle drums"
  • "Calm ambient lo-fi beat for studying"
  • "Dramatic orchestral intro for a presentation"
  • "Relaxing acoustic guitar background"

5. Dub (elevenlabs_dub)

Translate audio and video content into other languages while preserving speaker voice and emotion. This tool uses an asynchronous submit-and-poll pattern because dubbing can take several minutes to complete.

ParameterTypeRequiredDefaultDescription
file_urlstringFor submit-URL to the audio/video file (HTTPS). Max 50MB, 2.5 hours.
target_languagestringFor submit-ISO 639-1 language code for the dubbed output (e.g., es, fr, de)
source_languagestringNoautoISO 639-1 language code of the source audio
speakersintegerNo0Number of speakers (0 for auto-detection)
watermarkbooleanNofalseApply watermark to reduce credit cost
dubbing_idstringFor poll-ID of an existing dubbing job to check status

Two-mode operation:

  1. Submit mode - Provide file_url and target_language to start a new dubbing job. The tool returns a dubbing_id and a processing status.
  2. Poll mode - Provide the dubbing_id to check progress. Returns one of three statuses:
    • processing - Job is still running. Try again in 30-60 seconds.
    • complete - Dubbing is finished. A URL to the dubbed file is included.
    • failed - Job encountered an error. The error message is included.

Supported languages include: English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Swedish, Norwegian, Danish, Finnish, Turkish, Arabic, Hindi, Japanese, Korean, Chinese (Mandarin), and many more (32+ total).

tip

When configuring agent instructions for dubbing workflows, remind the agent to poll for completion using the dubbing_id. A typical dubbing job takes 1-5 minutes depending on file length.

6. Voice Changer (elevenlabs_voice_changer)

Transform the voice in an audio recording to a different voice while preserving the speech content.

ParameterTypeRequiredDefaultDescription
file_urlstringYes-URL to the source audio file (HTTPS). Supports MP3, WAV, M4A. Max 50MB.
voicestringNoRachelTarget voice name or ID. Available voices: Rachel, Domi, Bella, Antoni, Elli, Josh, Arnold, Adam, Sam.
modelstringNomultilingual_v2Model: flash or multilingual_v2
stabilitynumberNo0.5Voice stability (0-1)
similaritynumberNo0.75Voice similarity boost (0-1)

Returns a URL to the transformed audio file.

7. Audio Isolation (elevenlabs_audio_isolation)

Remove background noise from audio recordings, isolating clean speech.

ParameterTypeRequiredDefaultDescription
file_urlstringYes-URL to the audio file to clean (HTTPS). Supports MP3, WAV, M4A. Max 50MB.

Returns a URL to the cleaned audio file with background noise removed.

Common use cases:

  • Clean up noisy meeting recordings before transcription
  • Improve audio quality of field recordings
  • Prepare audio for voiceover or podcast production

Security & Limitations

Security

  • API key encrypted at rest - Stored using AES-256-CBC encryption; never exposed in GET API responses
  • Tenant isolation - Each tenant has its own ElevenLabs connection and tools, scoped by TenantId
  • Secure file storage - Generated files stored in tenant-scoped Azure Blob Storage paths
  • Time-limited access - SAS URLs expire after 1 hour with read-only, HTTPS-only permissions
  • Automatic file cleanup - Generated files are automatically deleted after 7 days via Azure lifecycle policy
  • SSRF protection - File URL inputs are validated for HTTPS-only and private IP blocking
  • Internal users only - ElevenLabs tools are gated to internal users or above; external senders cannot trigger content generation
  • Audit logging - Every tool execution is logged with the agent, parameters, and results

Limitations

  • No voice cloning - Voice cloning is intentionally excluded due to consent and legal considerations
  • Single connection per tenant - Only one ElevenLabs account per Control Bridge tenant
  • File size limit - Input files (for transcription, dubbing, voice changing, audio isolation) are limited to 50MB
  • Text-to-speech cap - Maximum 5,000 characters per text-to-speech call
  • Transcript truncation - Transcription output is truncated at 10,000 characters in text mode (use JSON format for full output)
  • Dubbing length - Dubbing supports files up to 2.5 hours in length
  • Sound effects duration - Sound effects are limited to 0.5-30 seconds
  • Music duration - Music generation supports 5-300 seconds (5 minutes)
  • Character quota - All operations consume characters from your ElevenLabs account quota; monitor usage on the connection page

Rate Limits

ElevenLabs enforces rate limits based on your subscription tier. If an agent hits a rate limit, the tool automatically retries with exponential backoff up to 3 times. Retry-After values exceeding 60 seconds cause immediate failure to avoid long agent stalls.

Troubleshooting

API Key Validation Fails

Problem: The API key is rejected when connecting

Solutions:

  1. Verify you copied the full API key from elevenlabs.io/app/settings/api-keys
  2. Ensure your ElevenLabs account is active and has a paid subscription
  3. Check that the API key has not been revoked or regenerated
  4. Try generating a new API key from the ElevenLabs dashboard

Connection Test Fails After Connecting

Problem: The Test Connection button returns an error

Solutions:

  1. Your API key may have been revoked - navigate to ElevenLabs dashboard and verify the key is still active
  2. Click Update API Key to enter a new key if the original was rotated
  3. Check your ElevenLabs subscription status - an expired plan will cause API failures

Agent Cannot Find ElevenLabs Tools

Problem: ElevenLabs tools do not appear when editing an agent

Solutions:

  1. Verify the ElevenLabs connection is active at Build > Connections > ElevenLabs Audio & Video
  2. Check that the "ElevenLabs Audio & Video" tool group exists at Build > AI Agents > Tool Groups
  3. Assign the tool group (not individual tools) to the agent
  4. Refresh the page and try again

Text-to-Speech Returns Error

Problem: Agent receives an error when generating speech

Solutions:

  1. Check that the text is under 5,000 characters - split longer text into multiple calls
  2. Verify the voice name is valid (use one of the premade voices or a valid custom voice ID)
  3. Check your ElevenLabs character quota on the connection page - you may have exhausted your monthly limit

Dubbing Job Stuck in Processing

Problem: A dubbing job remains in "processing" status after multiple polls

Solutions:

  1. Dubbing jobs for longer files can take several minutes - allow up to 10 minutes for large files
  2. If the job does not complete after 15 minutes, it may have failed silently. Submit a new dubbing job.
  3. Verify the source file URL is still accessible and has not expired

File URL Rejected

Problem: Agent receives "file_url is invalid" or a download error

Solutions:

  1. Ensure the file URL uses HTTPS (HTTP URLs are rejected)
  2. Verify the URL is publicly accessible or uses a valid SAS token
  3. Check that the file size is under 50MB
  4. Confirm the file format is supported (MP3, WAV, M4A, MP4, WebM)

Rate Limit Errors

Problem: Agent receives rate limit errors from ElevenLabs

Solutions:

  1. The agent will automatically retry with backoff - wait a moment and the request should succeed
  2. If errors persist, reduce the frequency of ElevenLabs operations by spacing out agent executions
  3. Consider upgrading your ElevenLabs plan for higher rate limits
  4. Avoid running multiple agents that use ElevenLabs tools simultaneously

Best Practices

Agent Instructions

Help your agents use ElevenLabs tools effectively:

When working with ElevenLabs audio and video tools:
1. For text-to-speech, keep text under 5,000 characters per call. Split
longer content into multiple calls.
2. Choose the right TTS model: 'flash' for quick previews, 'multilingual_v2'
for production quality, 'v3' for emotionally expressive speech.
3. For dubbing, always save the dubbing_id and poll for completion. Dubbing
jobs typically take 1-5 minutes.
4. When transcribing, specify the expected number of speakers if known for
better diarization accuracy.
5. For sound effects, be descriptive and specific (e.g., 'soft rain on a
window' rather than just 'rain').
6. Consider running audio_isolation before transcription if the source
recording has significant background noise.

Configuration

  • Assign the ElevenLabs tool group only to agents that genuinely need audio/video capabilities
  • Monitor character usage regularly on the connection page to avoid unexpected quota exhaustion
  • Consider creating a dedicated "Media Agent" with the ElevenLabs tool group rather than adding it to general-purpose agents
  • Test your agents with sample audio/video tasks after setup to verify they use the tools correctly

Security

  • Rotate your ElevenLabs API key periodically from the ElevenLabs dashboard, then update it on the connection page
  • Review agent execution logs to monitor what audio/video operations are being performed
  • Disconnect the ElevenLabs integration if your organization no longer needs audio/video capabilities
  • Generated files are automatically cleaned up after 7 days, but you can disconnect to stop new file creation immediately