CVI provides an end-to-end pipeline that takes in a user audio & video input and outputs a realtime replica AV output. This pipeline is hyper optimized, with layers tightly coupled to achieve the lowest latency in the market. CVI is highly customizable though, with the ability to customize or disable layers as well as different modes being offered to best fit your use case.

By default we always recommend to use as much of the CVI end-to-end pipeline as possible to guarantee the lowest latency and provide the best experience for your customers.

Layers

Tavus provides the following customizable layers as part of the CVI pipeline:

Pipeline Modes

Tavus offers a number of modes that come with preconfigured layers as necessary for your use case. You can configure the pipeline mode in the Create Persona API.

We by default recommend using the end-to-end pipeline in it’s entirety as it will provide the lowest latency and most optimized multimodal experience. We have a number of LLMs (Llama3.1, OpenAI) that we support that we’ve optimized within the end-to-end pipeline. With this end-to-end pipeline you can achieve ~500ms of utterance-to-utterance latency ---- the worlds fastest. You can load our LLMs full of your knowledge base and prompt them to your liking, as well as update the context live to simulate an async RAG application.

Custom LLM / Bring your own logic

Using a custom LLM is a great idea for those that already have a LLM or are building business logic that needs to intercept the input transcription and decide on the output. Using your own LLM will likely add latency, as the Tavus LLMs are hyper-optimized for low latency.

Note that the ‘Custom LLM’ mode doesn’t require an actual LLM. Any endpoint that will respond to chat completion requests in the required format can be used. For example, you could set up a server that takes in the completion requests and responds with predetermined responses, with no LLM involved at all.

Learn about how to use Custom LLM mode

Speech to Speech Mode

The Speech to Speech pipeline mode allows you to bypass ASR, LLM, and TTS by leveraging an external speech to speech model. You may use Tavus speech to speech model integrations or you may bring your own.

Note that in this mode vision capabilities from Tavus will be disabled, as there is nowhere to send the context to for now.

Echo Mode

You can specify audio or text input for the replica to speak out. We only recommend this if your application does not have a need for speech recognition (voice) or vision, or have a very specific ASR/Vision pipeline that you must use. Using your own ASR is most often slower and less optimized than using the integrated Tavus pipeline.

You can use text or audio input interchangeably in Echo Mode. There are two possible configurations, based on microphone enablement in Transport layer.

Text or Audio (Base64) Echo

By turning off the microphone in the Transport Layer and using the Interactions Protocol, you can achieve Text and Audio (base64) echo behavior.

  • The Text Echo behavior allows you to bypass Tavus Vision, ASR, and LLM and directly send text into the TTS layer. This allows you to have a replica that speaks all the text you provide, as well as allows you to manually control interrupts.

  • The Audio (Base64) Echo behavior allows you to bypass all Layers except for the Realtime Replica Layer. In this configuration, the replica will speak the audio that you provide.

In order to send text or base64 encoded audio, you should use the Interactions Protocol.

Microphone Echo

By keeping the microphone on in the Transport Layer, you are able to bypass all layers in CVI and directly pass in an audio stream that the replica will repeat. In this mode interrupts are handled within your audio stream, any received audio will be generated with the replica.

We only recommend this if you have pre-generated audio you would like to use, have a voice-to-voice pipeline, or have a very specific voice requirement.