CVI provides an end-to-end pipeline that takes in a user video input and outputs a realtime replica video. This pipeline is hyper optimized, with layers tightly coupled to achieve the lowest latency in the market. CVI is highly customizable though, with the ability to customize or disable layers as well as different modes being offered to best fit your use case.

By default we recommend using as much of the CVI end to end pipeline as possible to guarantee the lowest latency and provide the best experience for your customers.

Layers

Tavus provides the following customizable layers as part of the CVI pipeline

  • Video Conferencing / End-to-End WebRTC (powered by Daily)
    • Tavus provides an end to end video call solution, allowing you to get started quickly and directly jump into a room, or build a completely custom UI with raw video streams. You can read more on this in Integrating CVI
  • Vision
    • User input video can be processed using Vision, allowing the replica to see and respond to user expressions and environments. Vision can easily be disabled if not available or required.
  • Speech Recognition with VAD (Interrupts)
    • An optimized ASR system with incredibly fast and intelligent interrupts
  • LLM
    • Tavus provides ultra-low latency optimized LLMs or allows you to bring your own
  • TTS
    • Tavus provides the TTS audio using a low-latency optimized voice model (powered by Cartesia), or allows you to use one of the other supported voice providers
  • Realtime Replica
    • Tavus provides a high-quality streaming replica powered by our proprietary class of models, the Phoenix model

Pipeline Modes

Tavus offers a number of modes that allow disabling and replacing layers as necessary for your use case.

You can configure the pipeline mode in the Create Persona API.

The CVI pipeline offers four primary modes, each serving a unique purpose in layer customization to meet your use case

  • Full Pipeline Mode (Default and Recommended)
  • Speech to Speech Mode
  • Audio Echo Mode (Audio to Video)
  • Text Echo Mode (Text to Video)

We recommend using the end-to-end pipeline in its entirety as it will provide the lowest latency and most optimized multimodal experience. We have a number of LLMs (Llama3.1, OpenAI) that we have optimized within the end-to-end pipeline. With this end-to-end pipeline you can achieve ~900ms of utterance-to-utterance latency — the worlds fastest. You can load our LLMs full of your knowledge base and prompt them to your liking, as well as update the context live to simulate an async RAG application.

Custom LLM / Bring your own logic

Using a custom LLM is a great idea for those that already have a LLM or are building business logic that needs to intercept the input transcription and decide on the output. Using your own LLM will likely add latency, as the Tavus LLMs are hyper-optimized for low latency.

Note that the ‘Custom LLM’ does not require an actual LLM. Any endpoint that will respond to chat completion requests in the required format can be used. For example, you could set up a server that takes in the completion requests and responds with predetermined responses, with no LLM involved at all.

Learn about how to use a custom LLM

Speech to Speech Mode

The Speech to Speech pipeline mode allows you to bypass ASR, LLM, and TTS by leveraging an external speech to speech model. You may use Tavus speech to speech model integrations or you may bring your own.

Note that in this mode vision capabilities from Tavus will be disabled, as there is nowhere to send the context to for now.

Audio Echo Mode (Audio to Video)

The Audio Echo pipeline mode allows you to bypass all layers in CVI and directly pass in an audio stream that the replica will repeat. In this mode interrupts are handled within your audio stream, any received audio will be generated with the replica.

We only recommend this if you have pre-generated audio you would like to use, have a voice-to-voice pipeline, or have a very specific voice requirement.

You can still use the Daily room/WebRTC stream to receive your user’s video/audio stream and have them directly receive the replica video output, or you can use this mode server-server, where your server connects to the Daily/webRTC room to provide audio and then forwards the video stream to your user (though this will add latency).

Note that in this mode vision capabilities from Tavus will be disabled, as there is nowhere to send the context to. You can enable vision on your own by intercepting the participant feed.

Text Echo Mode (Text to Video)

Text Echo pipeline mode allows you to bypass Tavus Vision, ASR, and LLM and directly stream text into the TTS layer. This allows you to have a replica that ‘echoes back’ all text you provide, as well as allows you to manually control interrupts.

We only recommend this if your application does not have a need for speech recognition (voice) or vision, or have a very specific ASR/Vision pipeline that you must use. Using your own ASR is most often slower and less optimized than using the integrated Tavus pipeline.

Similar to Audio Echo Mode, you can still use the Daily room/WebRTC stream to receive your user’s video/audio stream and have them directly receive the replica video output, or you can use this mode server-server, where your server connects to the Daily/webRTC room to provide text and then forwards the video stream to your user (though this will add latency).

To use the Text Echo pipeline mode, you will need to leverage the Interactions Protocol to send text.