Skip to main content
Conversational Video Interface (CVI) is a framework for creating real-time multimodal video interactions with AI. It enables an AI agent to see, hear, and respond naturally, mirroring human conversation. CVI is the world’s fastest interface of its kind. It allows you to map a human face and conversational ability onto your AI agent. With CVI, you can achieve utterance-to-utterance latency with SLAs under 1 second. This is the full round-trip time for a participant to say something and the replica to reply. CVI provides a comprehensive solution, with the option to plug in your existing components as required.

Key Concepts

CVI is built around three core concepts that work together to create real-time, humanlike interactions with an AI agent:

Key Features

Natural Interaction

CVI uses facial cues, body language, and real-time turn-taking to enable natural, human-like conversations.

Modular pipeline

Customize the Perception, STT, LLM and TTS layers to control identity, behavior, and responses.

Lifelike AI replicas

Choose from over 100+ hyper-realistic digital twins or customize your own with human-like voice and expression.

Multilingual support

Hold natural conversations in 30+ languages using the supported TTS engines.

World's lowest latency

Experience real-time interactions with ~600ms response time and smooth turn-taking.

Layers

The Conversational Video Interface (CVI) is built on a modular layer system, where each layer handles a specific part of the interaction. Together, they capture input, process it, and generate a real-time, human-like response. Here’s how the layers work together:
Handles real-time audio and video streaming using WebRTC (powered by Daily). This layer captures the user’s microphone and camera input and delivers output back to the user.This layer is always enabled. You can configure input/output for audio (mic) and video (camera).
Uses Raven to analyze user expressions, gaze, background, and screen content. This visual context helps the replica understand and respond more naturally.Click here to learn how to configure the Perception layer.
Powered by Sparrow, this layer transcribes user speech in real time with lexical and semantic awareness. It enables smart, natural turn-taking through fast, intelligent interruptions.Click here to learn how to configure the Speech Recognition (STT) layer.
Processes the user’s transcribed speech and visual input using a low-latency LLM. Tavus provides ultra-low latency optimized LLMs or lets you integrate your own.Click here to learn how to configure the Large Language Model (LLM) layer.
Converts the LLM response into speech using the supported TTS Engines (Cartesia (Default), ElevenLabs).Click here to learn how to configure the Text-to-Speech (TTS) layer.
Delivers a high-quality, synchronized digital human response using Tavus’s real-time avatar engine powered by Phoenix.Click here to learn more about the Replica layer.
Most layers are configurable via the Persona.

Getting Started

You can quickly create a conversation by using the Developer Portal or following the steps in the Quickstart guide.
If you use Cursor, use this pre-built prompt to get started faster:
I