CVI provides an end-to-end pipeline that takes in a user video input and outputs a realtime replica video. This pipeline is hyper optimized, with layers tightly coupled to achieve the lowest latency in the market. CVI is highly customizable though, with the ability to customize or disable layers as well as different modes being offered to best fit your use case.

By default we recommend using as much of the CVI end to end pipeline as possible to guarantee the lowest latency and provide the best experience for your customers.

Layers

Tavus provides the following customizable layers as part of the CVI pipeline

  • Video Conferencing / End-to-End WebRTC (powered by Daily)
    • Tavus provides an end to end video call solution, allowing you to get started quickly and directly jump into a room, or build a completely custom UI with raw video streams. You can read more on this in Integrating CVI
  • Vision
    • User input video can be processed using Vision, allowing the replica to see and respond to user expressions and environments. Vision can easily be disabled if not available or required.
  • Speech Recognition with VAD (Interrupts)
    • An optimized ASR system with incredibly fast and intelligent interrupts
  • LLM
    • Tavus provides ultra-low latency optimized LLMs or allows you to bring your own
  • TTS
    • Tavus provides the TTS audio using a low-latency optimized voice model (powered by Cartesia), or allows you to use one of the other supported voice providers
  • Realtime Replica
    • Tavus provides a high-quality streaming replica powered by our proprietary class of models, the Phoenix model

Pipeline Modes

Tavus offers a number of modes that allow disabling and replacing layers as necessary for your use case.

You can configure the pipeline mode in the Create Persona API.

The CVI pipeline offers four primary modes, each serving a unique purpose in layer customization to meet your use case

  • Full Pipeline Mode (Default and Recommended)
  • Speech to Speech Mode
  • Audio Echo Mode (Audio to Video)
  • Echo Mode (Text to Video)

We recommend using the end-to-end pipeline in its entirety as it will provide the lowest latency and most optimized multimodal experience. We have a number of LLMs (Llama3.1, OpenAI) optimized within the end-to-end pipeline. With SLAs as fast as under 1s, you can access the world’s fastest utterance-to-utterance latency. You can load our LLMs full of your knowledge base and prompt them to your liking, as well as update the context live to simulate an async RAG application.

Custom LLM / Bring your own logic

Using a custom LLM is a great idea for those that already have a LLM or are building business logic that needs to intercept the input transcription and decide on the output. Using your own LLM will likely add latency, as the Tavus LLMs are hyper-optimized for low latency.

Note that the ‘Custom LLM’ does not require an actual LLM. Any endpoint that will respond to chat completion requests in the required format can be used. For example, you could set up a server that takes in the completion requests and responds with predetermined responses, with no LLM involved at all.

Learn about how to use a custom LLM

Speech to Speech Mode

The Speech to Speech pipeline mode allows you to bypass ASR, LLM, and TTS by leveraging an external speech to speech model. You may use Tavus speech to speech model integrations or you may bring your own.

Note that in this mode vision capabilities from Tavus will be disabled, as there is nowhere to send the context to for now.