Custom STT Onboarding
You can choose between tavus-turbo
and tavus-advanced
as your STT engine, and adjust the VAD sensitivity to your needs.
Create Persona
To get started, you’ll need to create a Persona that specifies your STT engine and VAD sensitivity. Here’s an example Persona:
<persona created>, id: p234324a
STT Engine
The STT engine parameter controls the transcription engine that will be used. The default is tavus-advanced
, but you can adjust this to tavus-turbo
for a tiny latency improvement. However, tavus-advanced
provides much higher transcription accuracy and supports non-English languages, so we highly recommend using it for almost all use cases.
Speech Sensitivity
These sensitivity parameters control the sensitivity of the Voice Activity Detection (VAD) engine. The defaults are medium
, but you can adjust this to low
or high
depending on your needs. You can use the guidelines below to choose the right sensitivity for your use case:
Participant Pause Sensitivity
Controls how long of a pause the user can take before the replica responds. You can think of this as the replica’s “pause” tolerance.
- low: A low sensitivity means you can take longer pauses before the replica responds. Use this for slower, more thoughtful conversations.
- medium: The default behavior. A nice balance between responsiveness and thoughtful pauses.
- high: A high sensitivity means the replica responds very quickly to the user’s speech. Use this for fast, chatty conversations, where small pauses from the participant will trigger a response.
Participant Interrupt Sensitivity
Controls how long the user can speak before the replica will be interrupted. You can think of this as the replica’s “interrupt” tolerance.
- low: A low sensitivity means you can talk longer before the replica will stop talking and listen. Use this for slower, more thoughtful conversations.
- medium: The default behavior. A nice balance between responsiveness and tolerance of short affirmations.
- high: A high sensitivity means the replica will stop talking very quickly when the user speaks. Use this for fast, chatty conversations, where small responses from the participant will trigger a new response from the replica.