Overview
CVI enables real-time, human-like video interactions through configurable lifelike replicas.
Conversational Video Interface (CVI) is a framework for creating real-time multimodal video interactions with AI. It enables an AI agent to see, hear, and respond naturally, mirroring human conversation.
CVI is the world’s fastest interface of its kind. It allows you to map a human face and conversational ability onto your AI agent. With CVI, you can achieve utterance-to-utterance latency with SLAs under 1 second. This is the full round-trip time for a participant to say something and the replica to reply.
CVI provides a comprehensive solution, with the option to plug in your existing components as required.
Key Concepts
CVI is built around three core concepts that work together to create real-time, humanlike interactions with an AI agent:
Persona
The Persona defines the agent’s behavior, tone, and knowledge. It also configures the CVI layer and pipeline.
Replica
The Replica brings the persona to life visually. It renders a photorealistic human-like avatar using the Phoenix-3 model.
Conversation
A Conversation is a real-time video session that connects the persona and replica through a WebRTC connection.
Key Features
Natural Interaction
CVI uses facial cues, body language, and real-time turn-taking to enable natural, human-like conversations.
Modular pipeline
Customize the Perception, STT, LLM and TTS layers to control identity, behavior, and responses.
Lifelike AI replicas
Choose from over 100+ hyper-realistic digital twins or customize your own with human-like voice and expression.
Multilingual support
Hold natural conversations in 30+ languages using the supported TTS engines.
World's lowest latency
Experience real-time interactions with ~600ms response time and smooth turn-taking.
Layers
The Conversational Video Interface (CVI) is built on a modular layer system, where each layer handles a specific part of the interaction. Together, they capture input, process it, and generate a real-time, human-like response.
Here’s how the layers work together:
1. Transport
1. Transport
Handles real-time audio and video streaming using WebRTC (powered by Daily). This layer captures the user’s microphone and camera input and delivers output back to the user.
This layer is always enabled. You can configure input/output for audio (mic) and video (camera).
2. Perception
2. Perception
Uses Raven to analyze user expressions, gaze, background, and screen content. This visual context helps the replica understand and respond more naturally.
Click here to learn how to configure the Perception layer.3. Speech Recognition (STT)
3. Speech Recognition (STT)
Powered by Sparrow, this layer transcribes user speech in real time with lexical and semantic awareness. It enables smart, natural turn-taking through fast, intelligent interruptions.
Click here to learn how to configure the Speech Recognition (STT) layer.4. Large Language Model (LLM)
4. Large Language Model (LLM)
Processes the user’s transcribed speech and visual input using a low-latency LLM. Tavus provides ultra-low latency optimized LLMs or lets you integrate your own.
Click here to learn how to configure the Large Language Model (LLM) layer.5. Text-to-Speech (TTS)
5. Text-to-Speech (TTS)
Converts the LLM response into speech using the supported TTS Engines (Cartesia (Default), ElevenLabs, PlayHT).
Click here to learn how to configure the Text-to-Speech (TTS) layer.6. Realtime Replica
6. Realtime Replica
Delivers a high-quality, synchronized digital human response using Tavus’s real-time avatar engine powered by Phoenix.
Click here to learn more about the Replica layer.Most layers are configurable via the Persona.
Getting Started
You can quickly create a conversation by using the Tavus Platform or following the steps in the Quickstart guide.