Speech to Speech Quickstart

Tavus

Take a look at our Docs and API Reference to learn how to use Tavus!

Introduction

Getting an API Key

Overview of Tavus' Replica offerings- Stock Replicas and Personal Replicas, all powered by the Phoenix AI model. Get tips on how to create the perfect replica, and how to get a high quality output.

Overview

Learn how to create a high-quality training video.

Replica Training

Learn how to use our API endpoints to create replicas.

Creating a Replica Via API

Explore Tavus' diverse library of ready-to-use stock replicas for effortless video creation and conversations.

Stock Replicas

Learn how to create a high-quality personal replica with just a few minutes of training data.

Personal Replicas

Language Support

This guide will walk you through the steps to quickly test out the API and start a conversation.

Quick Start

Creating a Conversation

Creating a Persona

Layers and Modes Overview

Stock Personas

Using Replicas in CVI

Turn Taking with Sparrow

Learn how to configure the perception layer with Raven.

Perception with Raven

Frequently asked questions about Tavus's Conversational Video Interface

Learn how to generate high-quality videos using Stock or Personal Replicas

Find out how to look at all the Stock Replicas as well as your Personal Replicas

Replica Selection

Learn how to create a high-quality script

Scripting

Synchronize audio with existing videos using Tavus's lipsync service. Easily create videos where the speaker's mouth movements match the provided audio.

Training Video Size

Consent Statement

Script Length

This guide includes an overview of errors and status details you might see from the API

API Errors and Status Details

This guide includes an overview of different callback formats you might see from the API.

API Callbacks

Changelog

With the Tavus Conversational Video Interface (CVI) you are able to create a `conversation` with a replica in real time.

### Conversations
A `conversation` is a video call with a replica. 

After creating a `conversation`, a `conversation_url` will be returned in the response. The `conversation_url` can be used to join the conversation directly or can be embedded in a website. To embed the `conversation_url` in a website, you can find [instructions here](https://www.daily.co/products/prebuilt-video-call-app/quickstart/).

Once a conversation is created, the replica will automatically join the call and will start participating.

By providing a `callback_url`, you can receive webhooks with updates regarding the conversation state.

[Learn about recording conversations here](/sections/conversational-video-interface/recording-rooms).
<Warning>
- **If your persona does not have a default replica**, the `replica_id` is required.
- **If your persona has a default replica**, the `replica_id` is not required.
- **If your persona has a default replica and you define `replica_id`**, it will override the persona's default replica.
</Warning>


Create Conversation

This endpoint returns a single conversation by its unique identifier.


Get Conversation

This endpoint returns a list of all Conversations created by the account associated with the API Key in use.


List Conversations

This endpoint ends a single conversation by its unique identifier.


End Conversation

This endpoint deletes a single conversation by its unique identifier.


Delete Conversation

Create and customize a digital replica's personality for Conversational Video Interface (CVI). A persona defines the replica's behavior and capabilities through configurable layers including:

**Core Components:**
- Replica - Choice of audio/visual appearance 
- Context - Customizable contextual information, for use by LLM 
- System Prompt - Customizable system prompt, for use by LLM
- Layers
  - Perception - Multimodal vision and understanding settings (Raven)
  - STT - Transcription and turn taking settings (Sparrow)
  - LLM - Language model settings 
  - TTS - Text-to-Speech settings
  {/*- STS - Speech-to-Speech settings*/}

When creating a conversation, the persona configuration determines how the replica interacts, processes information, and responds to participants. Each layer can be fine-tuned to achieve the desired conversational experience.

<Warning>
When using **full pipeline mode**, the `system_prompt` field is required.
</Warning>


Create Persona

This endpoint returns a single persona by its unique identifier.


Get Persona

This endpoint returns a list of all Personas created by the account associated with the API Key in use.


List Personas

This endpoint updates a persona using a JSON Patch payload (RFC 6902). You can modify **any field within the persona** using supported operations like `add`, `remove`, `replace`, `copy`, `move`, and `test`.

For example:

<Note>
Ensure the `path` match the current persona schema.
</Note>

```json
[
  { "op": "replace", "path": "/persona_name", "value": "Wellness Advisor" },
  { "op": "replace", "path": "/default_replica_id", "value": "r79e1c033f" },
  { "op": "replace", "path": "/context", "value": "Here are a few times that you have helped an individual make a breakthrough in..." },
  { "op": "replace", "path": "/layers/llm/model", "value": "tavus-gpt-4o" },
  { "op": "replace", "path": "/layers/tts/tts_engine", "value": "cartesia" },
  { "op": "add", "path": "/layers/tts/tts_emotion_control", "value": "true" },
  { "op": "remove", "path": "/layers/stt/hotwords" },
  { "op": "replace", "path": "/layers/perception/perception_tool_prompt", "value": "Use tools when identity documents are clearly shown." }
]
```


Patch Persona

This endpoint deletes a single persona by its unique identifier.


Delete Persona

This endpoint creates a new Replica that can be used in a conversation.

By default, all new replicas will be trained using the `phoenix-3` model. You can optionally create phoenix-2 replicas by setting the `model_name` parameter to `phoenix-2`.

The only required body parameter is `train_video_url`. This url must be a download link such as a presigned S3 url. Please ensure you pass in a video that meets the [requirements](/sections/troubleshooting/training-video-size) for training.

Replica training will fail without the following consent statement being present at the beginning of the video:
> I, [FULL NAME], am currently speaking and consent Tavus to create an AI clone of me by using the audio and video samples I provide. I understand that this AI clone can be used to create videos that look and sound like me.

Learn more about the consent statement [here](/sections/troubleshooting/consent-statement).

Learn more about training a personal Replica [here](/sections/replicas/personal-replicas).


Create Replica

This endpoint returns a single Replica by its unique identifier. 

Included in the response body is a `training_progress` string that represents the progress of the Replica training. If there are any errors during training, the `status` will be `error` and the `error_message` will be populated.


Get Replica

This endpoint returns a list of all Replicas created by the account associated with the API Key in use. In the response, a root level `data` key will contain the list of Replicas.


List Replicas

This endpoint deletes a Replica by its unique ID. Deleted Replicas cannot be used in a conversation.


Delete Replica

This endpoint renames a single Replica by its unique identifier.


Rename Replica

This endpoint generates a new video using a Replica and either a script or an audio file. 

The only required body parameters are `replica_id` and either `script` or `audio_file`. 

The `replica_id` is a unique identifier for the Replica that will be used to generate the video. The `script` is the text that will be spoken by the Replica in the video. If you would like to generate a video using an audio file instead of a script, you can provide `audio_url` instead of `script`. Currently, `.wav` and `.mp3` files are supported for audio file input.

If a `background_url` is provided, Tavus will record a video of the website and use it as the background for the video. If a `background_source_url` is provided, where the URL points to a download link such as a presigned S3 URL, Tavus will use the video as the background for the video. If neither are provided, the video will consist of a full screen Replica.

To learn more about generating videos with Replicas, see [here](/sections/video-generation/overview).

To learn more about writing an effective script for your video, see [Scripting prompting](/sections/video-generation/scripting-prompting).


Generate Video

This endpoint returns a single video by its unique identifier. 

The response body will contain a `status` string that represents the status of the video. If the video is ready, the response body will also contain a `download_url`, `stream_url`, and `hosted_url` that can be used to download, stream, and view the video respectively.


Get Video

This endpoint returns a list of all Videos created by the account associated with the API Key in use.


List Videos

This endpoint deletes a single video by its unique identifier.


Delete Video

This endpoint renames a single video by its unique identifier.


Rename Video

Create a new lipsync video by providing a video URL and an audio URL. The service will synchronize the speaker's mouth movements with the provided audio.


Create Lipsync

This endpoint returns a single lipsync by its unique identifier.


Get Lipsync

This endpoint returns a list of all Lipsyncs created by the account associated with the API Key in use.


List Lipsyncs

This endpoint deletes a single lipsync by its unique identifier.


Delete Lipsync

This endpoint generates an audio file based on a script with a provided Replica.


Generate Speech

This endpoint returns a single speech by its unique identifier.


Get Speech

This endpoint returns a list of all Speeches created by the account associated with the API Key in use.


List Speeches

This endpoint deletes a single speech by its unique identifier.


Delete Speech

This endpoint renames a single speech by its unique identifier.


Rename Speech

API Reference

Community

Status

Get Started

Login

Account Help

Get Support

Interact with the replica during live conversations.

This is an event developers may broadcast to Tavus.

By broadcasting this event, you are able to tell the replica what to exactly say. Anything that is passed in the `text` field will be spoken by the replica.

This is commonly used in combination with the [Interrupt Interaction](/sections/event-schemas/conversation-interrupt).


Echo Interaction

This is an event developers may broadcast to Tavus.

By broadcasting this event, you are able to send text that the replica will to respond to. The text you provide in the event will essentially be treated as the user transcript, and will be responded to as if the user had uttered those phrases during conversation.


Text Respond Interaction

This is an event developers may broadcast to Tavus.

By broadcasting this event, you are able to update the VAD (Voice Activity Detection) sensitivity of the replica in
two dimensions. 
- participant_pause_sensitivity
- participant_interrupt_sensitivity

The supported values are `superlow`, `verylow`, `low`, `medium`, and `high`.

Learn more about the `sensitivity`: [Get Started with Your Own STT](/sections/conversational-video-interface/custom-stt-onboarding)


Sensitivity Interaction

This is an event developers may broadcast to Tavus.

By broadcasting this event, you are able to externally send interruptions for the replica to stop talking. This is commonly used in combination with [Text Echo Interactions](/sections/event-schemas/conversation-echo).


Interrupt Interaction

This is an event broadcasted by Tavus.

An utterance event is broadcasted by Tavus when the replica is interrupted by the user while it is speaking.


Replica Interrupted Event

This is an event developers may broadcast to Tavus.

By broadcasting this event, you are able to overwrite the `conversational_context` that the replica uses to generate responses. 

If `conversational_context` was not provided during conversation creation, the replica will start using the `context` you provide in this event as `conversational_context`.

Learn more about the `conversational_context`: [Create Conversation](/api-reference/conversations/create-conversation)


Overwrite Conversational Context interaction

This is an event broadcasted by Tavus.

An `utterance event` is broadcasted by Tavus at specific times: the user’s utterance is sent when the replica begins speaking, and a separate event for the replica’s utterance is also sent as the replica starts to speak. Each event contains the content of the respective utterance as well as an indication of who spoke it.

An `utterance` includes all of the words spoken by the user or replica measured from when the person started speaking to when they finshed speaking. This could include multiple sentences or phrases.

Utterance events can be used to keep track of what the user or the replica has said.


Utterance Event

This is an event broadcasted by Tavus.

A `tool_call` event is broadcasted by Tavus when an LLM tool call should be made. The event will contain the name and arguments of the function that should be called.

Tool call events can be used to make calls to external APIs or databases.


Tool Call Event

This is an event broadcasted by Tavus.

A `replica.started_speaking/stopped_speaking event` is broadcasted by Tavus at specific times: 

conversation.replica.started_speaking means the replica has just started speaking.
conversation.replica.stopped_speaking means the replica has just stopped speaking.

When the `replica.stopped_speaking` event is sent, a `duration` field will be included in the event's `properties` object, indicating how long the replica was speaking for in seconds. This value may also be null.

These events are intended to act as triggers for actions within your application. For instance, you may want to
start a video or show a slide at times related to when the replica started or stopped speaking.

The inference_id can be used to correlate other events and tie things like conversation.utterance or tool_call
together.


Replica Started/Stopped Speaking Event

This is an event broadcasted by Tavus.

A `user.started_speaking/stopped_speaking event` is broadcasted by Tavus at specific times: 

conversation.user.started_speaking means the user has just started speaking.
conversation.user.stopped_speaking means the user has just stopped speaking.

These events are intended to act as triggers for actions within your application. For instance, you may want to
take some user facing action, or backend process at times related to when the user started or stopped speaking.

The inference_id can be used to correlate other events and tie things like conversation.utterance or tool_call
together. 

Keep in mind that with speculative_inference, the inference_id will frequently change while the user is speaking so
that the user.started_speaking inference_id will not usually match the conversation.utterance inference_id


User Started/Stopped Speaking Event

This is an event broadcasted by Tavus.

A `perception_tool_call` event is broadcasted by Tavus when a perception tool is triggered based on visual context. The event will contain the tool name, arguments, and encoded frames that triggered said tool call.

Perception tool calls can be used to trigger automated actions in response to visual cues detected by the Raven perception system.


Perception Tool Call Event

This is an event broadcasted by Tavus.

This is fired after ending a conversation, when the replica has finished summarizing the visual artifacts that were detected throughout the call. This is a feature that is only available when the persona has `raven-0` specified in the [Perception Layer](/sections/conversational-video-interface/raven).


Perception Analysis

Learn how to embed Tavus's Conversational Video Interface (CVI) into your site or app.

Embed Conversational Video Interface

You can set up a custom S3 bucket, enable recordings in rooms, and get notified when recordings are ready to be shared.

Replicas

Conversational Video Interface

Video Generation

Lipsync

Troubleshooting

Resources

Speech to Speech Quickstart