General
Training Video and Audio File Size Limit
Training Video and Audio File Size Limit
If you see an error about file size, it means your training video or audio file is larger than the 750 MB limit.Tavus supports training videos and audio files up to 750 MB. This limit helps maintain a balance between quality and processing speed.
To reduce file size:
Tavus requires the H.264 codec for all uploads.
- Compress the file using video compression tools.
- Lower the resolution - 1080p is usually enough.
- Trim any extra content to shorten the video.
- Reduce the frame rate to around 30 fps.
Conversational Video Interface (CVI)
PAL Responding to Background Noise
PAL Responding to Background Noise
If the PAL starts responding to background sounds, such as people talking nearby, it may be due to the absence of noise filtering.To resolve this, enable
voice_isolation in the Conversational Flow layer of your PAL. This filters background noise from the participant’s microphone audio, improving turn detection accuracy and overall conversation quality.Learn more in the Voice Isolation documentation.
PAL Is Not Joining the Conversation
PAL Is Not Joining the Conversation
This is a rare issue caused by an internal server problem. When it happens, our team is automatically notified and works to resolve it as quickly as possible.You can check the system status at status.tavus.io. We recommend checking periodically for updates if you encounter this error.
Conversational Flow vs STT: Relationship & Migration
Conversational Flow vs STT: Relationship & Migration
Relationship with STT Layer
The Conversational Flow layer is the recommended approach for configuring turn-taking behavior with Sparrow-1. This supersedes the legacy Sparrow-0 configuration available in the STT layer viasmart_turn_detection.Legacy Approach: Configuring turn-taking via the STT layer’s
smart_turn_detection parameter is a legacy approach that uses Sparrow-0. For new implementations, use the Conversational Flow layer with Sparrow-1 instead.turn_detection_model set to sparrow-1, these settings override any corresponding settings in the STT layer.Parameter Mapping: Sparrow-0 to Sparrow-1
Here’s how Sparrow-0 (STT layer) parameters map to Sparrow-1 (Conversational Flow layer):| Sparrow-0 (STT Layer) | Sparrow-1 (Conversational Flow Layer) | Notes |
|---|---|---|
participant_pause_sensitivity | turn_taking_patience | Controls how long to wait before responding |
participant_interrupt_sensitivity | pal_interruptibility | Controls How easily the PAL can be interrupted |
Migration Guide
If you’re currently using Sparrow-0 settings in the STT layer and want to upgrade to Sparrow-1:Before (Sparrow-0):Note the inverted mapping:
participant_pause_sensitivity: "high"(quick response) →turn_taking_patience: "low"(eager)participant_interrupt_sensitivity: "low"(hard to interrupt) →pal_interruptibility: "high"(easy to interrupt)
Legacy Voice Settings
Legacy Voice Settings
The
voice_settings parameter allows additional settings specific to the selected TTS engine. These settings vary per engine:| Parameter | Cartesia (Sonic-1 only) | ElevenLabs |
|---|---|---|
speed | Range -1.0 to 1.0 (negative = slower, positive = faster) | Range 0.7 to 1.2 (0.7 = slowest, 1.2 = fastest) |
emotion | Array of "emotion:level" tags (e.g., "positivity:high") | Not available |
stability | Not available | Range 0.0 to 1.0 (0.0 = variable, 1.0 = stable) |
similarity_boost | Not available | Range 0.0 to 1.0 (0.0 = creative, 1.0 = original) |
style | Not available | Range 0.0 to 1.0 (0.0 = neutral, 1.0 = exaggerated) |
use_speaker_boost | Not available | Boolean (enhances speaker similarity) |
For more information on each voice setting, see:
• Cartesia Speed and Emotion Controls
• ElevenLabs Voice Settings
• Cartesia Speed and Emotion Controls
• ElevenLabs Voice Settings
Migration from Legacy Perception to Raven-1
Migration from Legacy Perception to Raven-1
Raven-1 is now the default perception model. If you’re upgrading from
After (raven-1 with current field names):
raven-0 or using legacy field names, here’s how to migrate.Field Name Changes
The following fields have been renamed for clarity. Legacy names are still supported but deprecated:| Legacy Field Name | New Field Name | Notes |
|---|---|---|
ambient_awareness_queries | visual_awareness_queries | Visual stream monitoring |
perception_tool_prompt | visual_tool_prompt | Instructions for visual tools |
perception_tools | visual_tools | Visual-triggered functions |
tool_prompt | (removed) | Use visual_tool_prompt or audio_tool_prompt instead |
New Audio Fields (Raven-1 only)
Raven-1 introduces audio perception capabilities with these new fields:| Field Name | Description |
|---|---|
audio_awareness_queries | Custom queries monitoring the audio stream |
audio_tool_prompt | Instructions for audio-triggered tools |
audio_tools | Functions triggered by audio analysis |
Migration Example
Before (raven-0 with legacy field names):Raven-1 includes all visual capabilities from raven-0, plus new audio perception. You don’t need to change your visual configuration - just update field names and optionally add audio queries.For tool definitions, new integrations should use the tools registry (
/v2/tools + attach) instead of inline visual_tools / audio_tools. See Legacy inline tool calling if you still rely on inline tools.Legacy inline tool calling
Legacy inline tool calling
The Tools overview documents the current approach: create standalone tools at
Legacy perception field names (Function names must match Use After (registry + attach):For perception tools, set
/v2/tools, attach them with /v2/pals/{pal_id}/tools, and configure delivery, auth, and face behavior (on_call, on_resolve) per tool.Inline tools are the older pattern: OpenAI-style function objects embedded directly on the PAL under layers.llm.tools or layers.perception.*. They remain supported at runtime for existing PALs, but are deprecated for new integrations.Where inline tools live
| Tool type | Legacy inline location | Patch path (JSON Patch) | Event when fired |
|---|---|---|---|
| LLM (speech-triggered) | layers.llm.tools | /layers/llm/tools | conversation.tool_call |
| Vision (Raven sees) | layers.perception.visual_tools | /layers/perception/visual_tools | conversation.perception_tool_call |
| Audio (Raven hears) | layers.perception.audio_tools | /layers/perception/audio_tools | conversation.perception_tool_call |
perception_tools, perception_tool_prompt, ambient_awareness_queries) still map to the current names - see Migration from Legacy Perception to Raven-1 above.Pair inline perception tools with the matching prompt field:- Vision:
visual_tool_prompt(or legacyperception_tool_prompt) - Audio:
audio_tool_prompt(Raven-1 only)
perception_model: "raven-1" (recommended) or raven-0 for vision-only legacy setups.Inline tool shape
Each entry uses OpenAI function calling format:^[a-zA-Z_][a-zA-Z0-9_]{0,63}$.How inline LLM tools run
- The user speaks; the conversational LLM decides to invoke a function.
- Tavus broadcasts
conversation.tool_callto your Daily room with atool_call_id. - Your frontend executes the logic and returns a result via
conversation.tool_resultwith the matchingtool_call_id(unless the tool is configured as fire-and-forget in a registry attachment - inline tools have noon_resolvecontrol).
Example: inline LLM tools on create
Example: patch inline LLM tools
How inline perception tools run
Perception tools fire when Raven detects a visual or audio cue matching the tool’sdescription. They run in parallel with the conversational LLM - the PAL keeps speaking normally. Inline perception tools are effectively fire-and-forget: Tavus does not wait for or speak a tool result on the conversational side.- Raven matches a cue to an inline tool definition.
- Tavus broadcasts
conversation.perception_tool_call(modality: "vision"or"audio"). - Your frontend handles the event. Returning
conversation.tool_resultis optional and rarely needed unless you also attach a registry tool with differenton_resolvebehavior.
Example: inline vision tools
Example: patch inline vision tools
/layers/perception/audio_tools for inline audio tools the same way.Mixing inline and registry tools
At conversation start, Tavus merges inline definitions with tools attached via/v2/pals/{pal_id}/tools:- LLM tools: Registry attachments and inline
layers.llm.toolsare both advertised to the model. If the samenameexists in both places, the registry attachment wins. - Perception tools: Inline
visual_tools/audio_toolsare concatenated with attached vision/audio registry tools.
Migrating to the tools registry
Before (inline LLM tool):"origin": "vision" or "origin": "audio" on create instead of embedding under layers.perception. Attaching a vision or audio tool automatically bumps perception_model to raven-1 when needed.See Tool Calling for LLM, Tool Calling for Perception, and Tool Delivery for the current reference.Face
Personal Face Creation Failed
Personal Face Creation Failed
Face training can fail when training footage does not meet format, quality, or rights requirements.Check that your video follows Training from a video and Which training path?. You must have the necessary rights and permissions to use the likeness, voice, and footage you submit - see Platform Policies.If training still fails, review the error message in the PAL Maker or in Face training errors, then submit a new request through the PAL Maker or Create Face API.
Poor Face Quality
Poor Face Quality
If your face’s lip movements are noticeably out of sync, it may be due to issues with the training video format. Even if the video appears clean, AI-generated content or videos that don’t follow the expected structure can affect training quality.Common causes:
- The video does not follow the required recording format, which includes:
- 1 minute of talking
- 1 minute of silence
- Lips do not fully close during the talking segment, which limits the model’s ability to learn realistic lip movements.
- Record a new video following the correct structure (one minute of talking followed by one minute of silence).
- Speak naturally, allowing full lip movement including closures.
- Avoid using AI-generated videos for training.
Video Generation
Poor Video Generation Quality
Poor Video Generation Quality
If your video looks unnatural or has repeated gestures, it may be due to the script length. Videos over 5 minutes can lead to reduced movement variety and a less natural feel.To improve quality:
- Keep videos short – under 5 minutes is ideal.
- Break long scripts into smaller, focused segments.
- Tighten the script – remove filler and keep pacing steady.
- Use multiple faces for variety in longer content.
- Review and revise – check for repetition and adjust as needed.
If the issue persists after following the troubleshooting guide above, please don’t hesitate to contact our support team for further assistance.

