Speech-to-Text (STT)
Learn how to configure the STT layer to enable smart turn detection and enhance conversational flow.
The STT Layer in Tavus empowers your persona to transcribe and comprehend spoken input in real time. By default, the STT layer in Tavus leverages smart_turn_detection
, powered by Sparrow, for dynamic and responsive conversation flow with intelligent turn-taking.
Configuring the STT Layer
Define the STT layer under the layers.stt
object. Below are the parameters available:
1. stt_engine
Specifies the speech-to-text engine used for transcription.
-
Options:
tavus-advanced
(default) – Offers high-accuracy multilingual transcription.tavus-turbo
– Provides faster response times with slightly reduced accuracy.
2. participant_pause_sensitivity
Controls how long the participant can pause before the replica responds. This setting helps you fine-tune the pacing of the conversation.
-
Options:
high
: The replica replies quickly after short pauses. Good for fast and casual conversations.medium
(default): Balanced timing. Allows natural pauses without feeling rushed or delayed.low
: The replica waits a bit longer before replying. Useful for slower or more thoughtful discussions.verylow
: The replica allows even longer pauses before responding.superlow
: The replica has the longest response delay, making it suitable for conversations where participants often pause.
3. participant_interrupt_sensitivity
Controls how easily the participant can interrupt the replica while it is talking. This setting helps adjust how the replica handles overlap in conversation.
-
Options:
high
: The replica stops speaking immediately when the participant starts talking. Ideal for quick and back-and-forth exchanges.medium
(default): Balanced behavior. Allows short interruptions without breaking the flow.low
: The participant needs to speak more clearly or for a bit longer to interrupt.verylow
: The replica usually keeps talking unless the interruption is strong.superlow
: The replica rarely stops mid-sentence. It will usually finish speaking before responding.
4. hotwords
Use this to prioritize certain names or terms that are difficult to transcribe.
This field is only available for tavus-advanced
engine.
The above query helps the model transcribe “Roey” correctly instead of “Rowie.”
Use hotwords for proper nouns, brand names, or domain-specific language that standard STT engines might struggle with.
5. Turn-taking model
Enables dynamic turn-taking using the Sparrow model, which dynamically adjusts the timeout based on what the users say. It sets a longer timeout when the user is likely not done speaking, and a shorter timeout when the user is likely done speaking.
How Turn-taking Works
smart_turn_detection
is only supported by thetavus-advanced
engine.- Disabling
smart_turn_detection
turns off Sparrow and uses a fixed response delay based onparticipant_pause_sensitivity
.
Example Configuration
Below is an example persona with a fully configured STT layer:
Refer to the Create Persona API for a complete list of supported fields.