Conversational Video Interface (CVI) is a framework for creating real-time multimodal video interactions with AI. It enables an AI agent to see, hear, and respond naturally, mirroring human conversation.

CVI is the world’s fastest interface of its kind. It allows you to map a human face and conversational ability onto your AI agent. With CVI, you can achieve utterance-to-utterance latency with SLAs under 1 second. This is the full round-trip time for a participant to say something and the replica to reply.

CVI provides a comprehensive solution, with the option to plug in your existing components as required.

Key Concepts

CVI is built around three core concepts that work together to create real-time, humanlike interactions with an AI agent:

Key Features

Natural Interaction

CVI uses facial cues, body language, and real-time turn-taking to enable natural, human-like conversations.

Modular pipeline

Customize the Perception, STT, LLM and TTS layers to control identity, behavior, and responses.

Lifelike AI replicas

Choose from over 100+ hyper-realistic digital twins or customize your own with human-like voice and expression.

Multilingual support

Hold natural conversations in 30+ languages using the supported TTS engines.

World's lowest latency

Experience real-time interactions with ~600ms response time and smooth turn-taking.

Layers

The Conversational Video Interface (CVI) is built on a modular layer system, where each layer handles a specific part of the interaction. Together, they capture input, process it, and generate a real-time, human-like response.

Here’s how the layers work together:

Most layers are configurable via the Persona.

Getting Started

You can quickly create a conversation by using the Tavus Platform or following the steps in the Quickstart guide.