Documentation Index
Fetch the complete documentation index at: https://docs.tavus.io/llms.txt
Use this file to discover all available pages before exploring further.

- CVI — Real-time multimodal video: the agent sees, hears, and responds; media runs over WebRTC (powered by Daily).
- Latency — Utterance-to-utterance round-trip is optimized for real-time use (participant speaks → replica replies).
- Three pillars — Persona (behavior, knowledge, and CVI layer pipeline); Replica (visual digital human, Phoenix); Conversation (live session linking persona and replica).
- Pipeline (in order) — Perception (Raven) → Conversational Flow (Sparrow) → Speech recognition (STT) → Large language model (LLM) → Text-to-speech (TTS) → Realtime replica (Phoenix). Raven is visual perception; Sparrow handles turn-taking and interruptibility; Phoenix is the real-time visual replica engine.
- Where to configure — Most layers are set on the Persona.
Key Concepts
CVI is built around three core concepts that work together to create real-time, humanlike interactions with an AI agent:Persona
The Persona defines the agent’s behavior, tone, and knowledge. It also configures the CVI layer and pipeline.
Replica
The Replica brings the persona to life visually. It renders a photorealistic human-like avatar using Phoenix.
Conversation
A Conversation is a real-time video session that connects the persona and replica through a WebRTC connection.
Key Features
Natural Interaction
CVI uses facial cues, body language, and real-time turn-taking to enable natural, human-like conversations.
Modular pipeline
Customize the Perception, STT, LLM and TTS layers to control identity, behavior, and responses.
Lifelike AI replicas
Choose from over 100+ hyper-realistic stock replicas or customize your own with human-like voice and expression.
Multilingual support
Hold natural conversations in 42+ languages using the supported TTS engines.
World's lowest latency
Experience real-time interactions with low utterance-to-utterance latency and smooth turn-taking.
Layers
The Conversational Video Interface (CVI) is built on a modular layer system, where each layer handles a specific part of the interaction. Together, they capture input, process it, and generate a real-time, human-like response. Here’s how the layers work together:1. Perception
1. Perception
Uses Raven to analyze user expressions, gaze, background, and screen content. This visual context helps the replica understand and respond more naturally.Configure the Perception layer
2. Conversational Flow
2. Conversational Flow
Controls the natural dynamics of conversation, including turn-taking and interruptibility. Uses Sparrow for intelligent turn detection, enabling the replica to decide when to speak and when to listen.Configure the Conversational Flow layer
3. Speech Recognition (STT)
3. Speech Recognition (STT)
This layer transcribes user speech in real time with lexical and semantic awareness.Configure the Speech Recognition (STT) layer
4. Large Language Model (LLM)
4. Large Language Model (LLM)
Processes the user’s transcribed speech and visual input using a low-latency LLM. Tavus provides ultra-low latency optimized LLMs or lets you integrate your own.Configure the Large Language Model (LLM) layer
5. Text-to-Speech (TTS)
5. Text-to-Speech (TTS)
Converts the LLM response into speech using the supported TTS Engines (Cartesia (Default), ElevenLabs).Configure the Text-to-Speech (TTS) layer
6. Realtime replica (Phoenix)
6. Realtime replica (Phoenix)
Delivers a high-quality, synchronized digital human using Tavus’s real-time avatar engine (Phoenix).Replica overview
Most layers are configurable via the Persona.

