Custom STT Onboarding

Create Persona

To get started, you’ll need to create a Persona that specifies your STT engine and VAD sensitivity. Here’s an example Persona:

{
    "system_prompt": "You are a storyteller. You like telling stories to people of all ages. Reply in brief utterances, and ask prompting questions to the user as you tell your stories to keep them engaged.",
    "context": "Here are some of your favorite stories: Little Red Riding Hood, The Ugly Duckling and The Three Little Pigs",
    "persona_name": "Mert the Storyteller",
    "layers": {
        "llm": {
            "model": "custom_model_here",
            "api_key": "example-api-key",
            "base_url": "open-ai-compatible-llm-http-endpoint",
            "tools": [<your-tools-here>],
            "speculative_inference": true,
        },
        "tts": {
            "api_key": "example-api-key",
            "tts_engine": "playht",
            "playht_user_id": "your-playht-user-id",
            "external_voice_id": "professional-voice-clone-id",
            "voice_settings": {}, // can also leave the "voice_settings" attr out if you want to use default settings
            "tts_emotion_control": false
        },
        "perception": {
            "perception_model": "raven-0", // or "basic" for simpler vision capabilities
        },
        "stt": {
            "participant_pause_sensitivity": "high",
            "participant_interrupt_sensitivity": "high",
            "smart_turn_detection": true,
            "stt_engine": "tavus-advanced"
        }
    }
}

<persona created>, id: p234324a

STT Engine

The STT engine parameter controls the transcription engine that will be used. The default is tavus-advanced, but you can adjust this to tavus-turbo for a tiny latency improvement. However, tavus-advanced provides much higher transcription accuracy and supports non-English languages, so we highly recommend using it for almost all use cases.

Speech Sensitivity

These sensitivity parameters control the sensitivity of the Voice Activity Detection (VAD) engine. The defaults are medium, but you can adjust this to low or high depending on your needs. You can use the guidelines below to choose the right sensitivity for your use case:

Participant Pause Sensitivity

Controls how long of a pause the user can take before the replica responds. You can think of this as the replica’s “pause” tolerance.

high: The replica replies quickly after short pauses. Good for fast and casual conversations.
medium (default): Balanced timing. Allows natural pauses without feeling rushed or delayed.
low: The replica waits a bit longer before replying. Useful for slower or more thoughtful discussions.
verylow: The replica allows even longer pauses before responding.
superlow: The replica has the longest response delay, making it suitable for conversations where participants often pause.

Participant Interrupt Sensitivity

Controls how long the user can speak before the replica will be interrupted. You can think of this as the replica’s “interrupt” tolerance.

high: The replica stops speaking immediately when the participant starts talking. Ideal for quick and back-and-forth exchanges.
medium (default): Balanced behavior. Allows short interruptions without breaking the flow.
low: The participant needs to speak more clearly or for a bit longer to interrupt.
verylow: The replica usually keeps talking unless the interruption is strong.
superlow: The replica rarely stops mid-sentence. It will usually finish speaking before responding.

Smart Turn Detection

When enabled, Sparrow-0 ensures highly natural interactions by intelligently evaluating semantic and lexical conversation cues in real-time. It:

Continuously assesses speech patterns and conversation content
Seamlessly integrates heuristic strategies and machine learning to refine turn-taking
Ensures minimal latency overhead, adding only 10ms, enabling response times as fast as 600ms when needed

Key Benefits:

Enhanced naturalness: Conversations feel more human-like and fluid.
Reduced latency: Only adds 10ms latency, supporting rapid conversational interactions.
Continuous improvement: Gets smarter and more nuanced over time using adaptive learning.

Replicas

Conversational Video Interface

Video Generation

Lipsync

Troubleshooting

Resources

Custom STT Onboarding

Create Persona

STT Engine

Speech Sensitivity

Participant Pause Sensitivity

Participant Interrupt Sensitivity

Smart Turn Detection

Replicas

Conversational Video Interface

Video Generation

Lipsync

Troubleshooting

Resources

​Create Persona

​STT Engine

​Speech Sensitivity

​Participant Pause Sensitivity

​Participant Interrupt Sensitivity

​Smart Turn Detection

Create Persona

STT Engine

Speech Sensitivity

Participant Pause Sensitivity

Participant Interrupt Sensitivity

Smart Turn Detection