Troubleshooting

General

Training Video and Audio File Size Limit

If you see an error about file size, it means your training video or audio file is larger than the 750 MB limit.Tavus supports training videos and audio files up to 750 MB. This limit helps maintain a balance between quality and processing speed.

Tavus requires the H.264 codec for all uploads.

To reduce file size:

Compress the file using video compression tools.
Lower the resolution — 1080p is usually enough.
Trim any extra content to shorten the video.
Reduce the frame rate to around 30 fps.

Conversational Video Interface (CVI)

Replica Responding to Background Noise

If the replica starts responding to background sounds, such as people talking nearby, it may be due to the absence of noise filtering.To resolve this, enable noise cancellation using Daily’s updateInputSettings() method. For example:

callFrame.updateInputSettings({
  audio: {
    processor: {
      type: 'noise-cancellation',
    },
  },
});

Learn more in the Daily SDK documentation.

Replica Is Not Joining the Conversation

This is a rare issue caused by an internal server problem. When it happens, our team is automatically notified and works to resolve it as quickly as possible.You can check the system status at status.tavus.io. We recommend checking periodically for updates if you encounter this error.

Conversational Flow vs STT: Relationship & Migration

Relationship with STT Layer

The Conversational Flow layer is the recommended approach for configuring turn-taking behavior with Sparrow-1. This supersedes the legacy Sparrow-0 configuration available in the STT layer via smart_turn_detection.

Legacy Approach: Configuring turn-taking via the STT layer’s smart_turn_detection parameter is a legacy approach that uses Sparrow-0. For new implementations, use the Conversational Flow layer with Sparrow-1 instead.

When you configure the Conversational Flow layer with turn_detection_model set to sparrow-1, these settings override any corresponding settings in the STT layer.

Parameter Mapping: Sparrow-0 to Sparrow-1

Here’s how Sparrow-0 (STT layer) parameters map to Sparrow-1 (Conversational Flow layer):

Sparrow-0 (STT Layer)	Sparrow-1 (Conversational Flow Layer)	Notes
`participant_pause_sensitivity`	`turn_taking_patience`	Controls how long to wait before responding
`participant_interrupt_sensitivity`	`replica_interruptibility`	Controls how easily the replica can be interrupted

Important: When using Sparrow-1 via the Conversational Flow layer, any conflicting settings in the STT layer (Sparrow-0) will be overridden. For example, if you set participant_pause_sensitivity: "high" in the STT layer but turn_taking_patience: "low" in the Conversational Flow layer with turn_detection_model: "sparrow-1", the Conversational Flow setting (low) will take precedence.

Migration Guide

If you’re currently using Sparrow-0 settings in the STT layer and want to upgrade to Sparrow-1:Before (Sparrow-0):

{
  "layers": {
    "stt": {
      "participant_pause_sensitivity": "high",
      "participant_interrupt_sensitivity": "low"
    }
  }
}

After (Sparrow-1):

{
  "layers": {
    "conversational_flow": {
      "turn_detection_model": "sparrow-1",
      "turn_taking_patience": "low",
      "replica_interruptibility": "high"
    }
  }
}

Note the inverted mapping:

participant_pause_sensitivity: "high" (quick response) → turn_taking_patience: "low" (eager)
participant_interrupt_sensitivity: "low" (hard to interrupt) → replica_interruptibility: "high" (easy to interrupt)

The naming has been updated in Sparrow-1 to be more intuitive from the replica’s perspective.

Legacy Voice Settings

The voice_settings parameter allows additional settings specific to the selected TTS engine. These settings vary per engine:

Parameter	Cartesia (Sonic-1 only)	ElevenLabs
`speed`	Range `-1.0` to `1.0` (negative = slower, positive = faster)	Range `0.7` to `1.2` (`0.7` = slowest, `1.2` = fastest)
`emotion`	Array of `"emotion:level"` tags (e.g., `"positivity:high"`)	Not available
`stability`	Not available	Range `0.0` to `1.0` (`0.0` = variable, `1.0` = stable)
`similarity_boost`	Not available	Range `0.0` to `1.0` (`0.0` = creative, `1.0` = original)
`style`	Not available	Range `0.0` to `1.0` (`0.0` = neutral, `1.0` = exaggerated)
`use_speaker_boost`	Not available	Boolean (enhances speaker similarity)

For more information on each voice setting, see:
• Cartesia Speed and Emotion Controls
• ElevenLabs Voice Settings

"tts": {
  "voice_settings": {
    "speed": 0.5,
    "emotion": ["positivity:high", "curiosity"]
  }
}

This is a legacy approach. The recommended method for controlling emotion, speed, and volume is now outlined in the TTS documentation.

Migration from Legacy Perception to Raven-1

Raven-1 is now the default perception model. If you’re upgrading from raven-0 or using legacy field names, here’s how to migrate.

Field Name Changes

The following fields have been renamed for clarity. Legacy names are still supported but deprecated:

Legacy Field Name	New Field Name	Notes
`ambient_awareness_queries`	`visual_awareness_queries`	Visual stream monitoring
`perception_tool_prompt`	`visual_tool_prompt`	Instructions for visual tools
`perception_tools`	`visual_tools`	Visual-triggered functions
`tool_prompt`	(removed)	Use `visual_tool_prompt` or `audio_tool_prompt` instead

New Audio Fields (Raven-1 only)

Raven-1 introduces audio perception capabilities with these new fields:

Field Name	Description
`audio_awareness_queries`	Custom queries monitoring the audio stream
`audio_tool_prompt`	Instructions for audio-triggered tools
`audio_tools`	Functions triggered by audio analysis

Migration Example

Before (raven-0 with legacy field names):

{
  "layers": {
    "perception": {
      "perception_model": "raven-0",
      "ambient_awareness_queries": ["Is the user showing an ID?"],
      "perception_tool_prompt": "Use notify_id when an ID is detected.",
      "perception_tools": [
        {
          "type": "function",
          "function": {
            "name": "notify_id",
            "description": "Notify when ID is detected"
          }
        }
      ]
    }
  }
}

After (raven-1 with current field names):

{
  "layers": {
    "perception": {
      "perception_model": "raven-1",
      "visual_awareness_queries": ["Is the user showing an ID?"],
      "visual_tool_prompt": "Use notify_id when an ID is detected.",
      "visual_tools": [
        {
          "type": "function",
          "function": {
            "name": "notify_id",
            "description": "Notify when ID is detected"
          }
        }
      ],
      "audio_awareness_queries": ["Does the user sound frustrated?"]
    }
  }
}

Raven-1 includes all visual capabilities from raven-0, plus new audio perception. You don’t need to change your visual configuration—just update field names and optionally add audio queries.

Replica

Personal Replica Creation Failed

This error usually means your training video is missing the required consent statement or the statement wasn’t clearly spoken.To generate a digital replica using the Phoenix model, your video must include this line at the beginning, spoken clearly:

“I, [FULL NAME], am currently speaking and give consent to Tavus to create an AI clone of me by using the audio and video samples I provide. I understand that this AI clone can be used to create videos that look and sound like me.”

Make sure to replace [FULL NAME] with your actual name. The consent must be easy to hear and can be spoken in any supported language. You can view the list of supported languages here.If your video didn’t include this, re-record it with the consent statement at the beginning, then submit a new request through the Developer Portal or API.

Poor Replica Quality

If your replica’s lip movements are noticeably out of sync, it may be due to issues with the training video format. Even if the video appears clean, AI-generated content or videos that don’t follow the expected structure can affect training quality.Common causes:

The video does not follow the required recording format, which includes:
- 1 minute of talking
- 1 minute of silence
Lips do not fully close during the talking segment, which limits the model’s ability to learn realistic lip movements.

To improve your replica:

Record a new video following the correct structure (one minute of talking followed by one minute of silence).
Speak naturally, allowing full lip movement including closures.
Avoid using AI-generated videos for training.

For more details, see the Replica Training Guide.

Video Generation

Poor Video Generation Quality

If your video looks unnatural or has repeated gestures, it may be due to the script length. Videos over 5 minutes can lead to reduced movement variety and a less natural feel.To improve quality:

Keep videos short – under 5 minutes is ideal.
Break long scripts into smaller, focused segments.
Tighten the script – remove filler and keep pacing steady.
Use multiple replicas for variety in longer content.
Review and revise – check for repetition and adjust as needed.

If the issue persists after following the troubleshooting guide above, please don’t hesitate to contact our support team for further assistance.

Getting Started

Conversational Video Interface

Replica

Video Generation

Resources

Troubleshooting

General

Conversational Video Interface (CVI)

Relationship with STT Layer

Parameter Mapping: Sparrow-0 to Sparrow-1

Migration Guide

Field Name Changes

New Audio Fields (Raven-1 only)

Migration Example

Replica

Video Generation

Getting Started

Conversational Video Interface

Replica

Video Generation

Resources

​General

​Conversational Video Interface (CVI)

​Relationship with STT Layer

​Parameter Mapping: Sparrow-0 to Sparrow-1

​Migration Guide

​Field Name Changes

​New Audio Fields (Raven-1 only)

​Migration Example

​Replica

​Video Generation

General

Conversational Video Interface (CVI)

Relationship with STT Layer

Parameter Mapping: Sparrow-0 to Sparrow-1

Migration Guide

Field Name Changes

New Audio Fields (Raven-1 only)

Migration Example

Replica

Video Generation