Skip to main content
The TTS Layer in Tavus enables your persona to generate natural-sounding voice responses. You can configure the TTS layer using a third-party tts engine provider. If layers.tts is not specified, Tavus will default to cartesia engine.
If you use the default engine, you do not need to specify any parameters within the tts layer.

Configuring the TTS Layer

Define the TTS layer under the layers.tts object. Below are the parameters available:

1. tts_engine

Specifies the supported third-party TTS engine.
  • Options: cartesia, elevenlabs.
"tts": {
  "tts_engine": "cartesia"
}

2. api_key

Authenticates requests to your selected third-party TTS provider. You can obtain an API key from one of the following:
Only required when using private voices.
"tts": {
  "api_key": "your-api-key"
}

3. external_voice_id

Specifies which voice to use with the selected TTS engine. To find supported voice IDs, refer to the provider’s documentation:
You can use any publicly accessible custom voice from ElevenLabs or Cartesia without the provider’s API key. If the custom voice is private, you still need to use the provider’s API key.
"tts": {
  "external_voice_id": "external-voice-id"
}

4. tts_model_name

Model name used by the TTS engine. Refer to:
"tts": {
  "tts_model_name": "sonic-3"
}

5. tts_emotion_control

If set to true, enables emotion control in speech. Defaults to true.
"tts": {
  "tts_emotion_control": true
}

6. voice_settings

Optional object for controlling speed, volume, and similar effects. Which approach you use depends on your TTS engine and model:
EngineModelApproach
ElevenLabsAll modelsvoice_settings in persona config
Cartesiasonic-2voice_settings in persona config
Cartesiasonic-3Either voice_settings (global, set once per conversation) or prompt the LLM in system_prompt to output Cartesia SSML tags for dynamic control. Not both.
Cartesia sonic-3: If you use voice_settings for speed/volume, those settings apply globally for the whole conversation and you cannot use SSML tags for dynamic, per-phrase control. If you want dynamic control, omit voice_settings and have the LLM output SSML tags instead. See Cartesia volume, speed, and emotion.
ElevenLabs (all models): Set parameters in the voice_settings object:
ParameterElevenLabs
speedRange 0.7 to 1.2 (0.7 = slowest, 1.2 = fastest)
stabilityRange 0.0 to 1.0 (0.0 = variable, 1.0 = stable)
similarity_boostRange 0.0 to 1.0 (0.0 = creative, 1.0 = original)
styleRange 0.0 to 1.0 (0.0 = neutral, 1.0 = exaggerated)
use_speaker_boostBoolean (enhances speaker similarity)
See ElevenLabs Voice Settings for details.
Cartesia sonic-2: Use the voice_settings object (e.g. speed, emotion). SSML tags are not used for sonic-2. Cartesia sonic-3: You can use either of these, but not both:
  • voice_settings — We accept speed/volume params for sonic-3. They apply globally, set once per conversation. Use this when you want a single default speed and volume for the entire conversation. Using voice_settings prevents dynamic SSML control.
  • SSML in LLM output — Omit voice_settings for speed/volume and instead add instructions to your system_prompt so the LLM outputs Cartesia SSML tags in its responses. This gives you dynamic, per-phrase control. See Cartesia volume, speed, and emotion.
Emotion control is separate; see Emotion Control with Phoenix-4. Example: system prompt for Cartesia sonic-3 (dynamic speed and volume) If you are not using voice_settings for sonic-3, add instructions like this to your system_prompt so the LLM outputs Cartesia SSML tags:
When you want to emphasize a word or phrase, use Cartesia SSML tags for speed and volume:
- To slow down: <speed level="0.8">phrase</speed>
- To speed up: <speed level="1.2">phrase</speed>
- To speak louder: <volume level="1.2">phrase</volume>
- To speak more quietly: <volume level="0.8">phrase</volume>
You can combine tags, e.g. <speed level="0.9"><volume level="1.1">important point</volume></speed>.
Only use these tags when it improves clarity or emphasis; keep most of your response in plain text.
Example: voice_settings (ElevenLabs, Cartesia sonic-2, or Cartesia sonic-3 global)
"tts": {
  "voice_settings": {
    "speed": 0.9
  }
}
For sonic-3, this sets global speed once per conversation; for sonic-2 and ElevenLabs, it applies as configured.

Example Configuration

Below is an example persona with a fully configured TTS layer:
{
  "persona_name": "AI Presenter",
  "system_prompt": "You are a friendly and informative video host.",
  "pipeline_mode": "full",
  "context": "You're delivering updates in a conversational tone.",
  "default_replica_id": "r665388ec672",
  "layers": {
    "tts": {
      "tts_engine": "cartesia",
      "api_key": "your-api-key",
      "external_voice_id": "external-voice-id",
      "tts_emotion_control": true,
      "tts_model_name": "sonic-3"
    }
  }
}
Refer to the Create Persona API for a complete list of supported fields.