The STT Layer in Tavus empowers your persona to transcribe and comprehend spoken input in real time. By default, the STT layer in Tavus leverages smart_turn_detection, powered by Sparrow, for dynamic and responsive conversation flow with intelligent turn-taking.

Configuring the STT Layer

Define the STT layer under the layers.stt object. Below are the parameters available:

1. stt_engine

Specifies the speech-to-text engine used for transcription.

  • Options:

    • tavus-advanced (default) – Offers high-accuracy multilingual transcription.
    • tavus-turbo – Provides faster response times with slightly reduced accuracy.
"layers": {
  "stt": {
    "stt_engine": "tavus-advanced"
  }
}

2. participant_pause_sensitivity

Controls how long the participant can pause before the replica responds. This setting helps you fine-tune the pacing of the conversation.

  • Options:

    • high: The replica replies quickly after short pauses. Good for fast and casual conversations.
    • medium (default): Balanced timing. Allows natural pauses without feeling rushed or delayed.
    • low: The replica waits a bit longer before replying. Useful for slower or more thoughtful discussions.
    • verylow: The replica allows even longer pauses before responding.
    • superlow: The replica has the longest response delay, making it suitable for conversations where participants often pause.
"participant_pause_sensitivity": "medium"

3. participant_interrupt_sensitivity

Controls how easily the participant can interrupt the replica while it is talking. This setting helps adjust how the replica handles overlap in conversation.

  • Options:

    • high: The replica stops speaking immediately when the participant starts talking. Ideal for quick and back-and-forth exchanges.
    • medium (default): Balanced behavior. Allows short interruptions without breaking the flow.
    • low: The participant needs to speak more clearly or for a bit longer to interrupt.
    • verylow: The replica usually keeps talking unless the interruption is strong.
    • superlow: The replica rarely stops mid-sentence. It will usually finish speaking before responding.
"participant_interrupt_sensitivity": "medium"

4. hotwords

Use this to prioritize certain names or terms that are difficult to transcribe.

This field is only available for tavus-advanced engine.

"hotwords": "Roey is the name of the person you're speaking with."

The above query helps the model transcribe “Roey” correctly instead of “Rowie.”

Use hotwords for proper nouns, brand names, or domain-specific language that standard STT engines might struggle with.

5. Turn-taking model

Enables dynamic turn-taking using the Sparrow model, which dynamically adjusts the timeout based on what the users say. It sets a longer timeout when the user is likely not done speaking, and a shorter timeout when the user is likely done speaking.

"smart_turn_detection": true

How Turn-taking Works

  • smart_turn_detection is only supported by the tavus-advanced engine.
  • Disabling smart_turn_detection turns off Sparrow and uses a fixed response delay based on participant_pause_sensitivity.

Example Configuration

Below is an example persona with a fully configured STT layer:

{
  "persona_name": "Customer Service Agent",
  "system_prompt": "You assist users by listening carefully and providing helpful answers.",
  "pipeline_mode": "full",
  "context": "You're handling voice-based customer support inquiries.",
  "default_replica_id": "rfe12d8b9597",
  "layers": {
    "stt": {
      "stt_engine": "tavus-advanced",
      "participant_pause_sensitivity": "medium",
      "participant_interrupt_sensitivity": "low",
      "hotwords": "support",
      "smart_turn_detection": true
    }
  }
}

Refer to the Create Persona API for a complete list of supported fields.