Skip to main content

General

If you see an error about file size, it means your training video or audio file is larger than the 750 MB limit.Tavus supports training videos and audio files up to 750 MB. This limit helps maintain a balance between quality and processing speed.
Tavus requires the H.264 codec for all uploads.
To reduce file size:
  • Compress the file using video compression tools.
  • Lower the resolution - 1080p is usually enough.
  • Trim any extra content to shorten the video.
  • Reduce the frame rate to around 30 fps.

Conversational Video Interface (CVI)

If the PAL starts responding to background sounds, such as people talking nearby, it may be due to the absence of noise filtering.To resolve this, enable voice_isolation in the Conversational Flow layer of your PAL. This filters background noise from the participant’s microphone audio, improving turn detection accuracy and overall conversation quality.
{
  "layers": {
    "conversational_flow": {
      "voice_isolation": "near"
    }
  }
}
This is a rare issue caused by an internal server problem. When it happens, our team is automatically notified and works to resolve it as quickly as possible.You can check the system status at status.tavus.io. We recommend checking periodically for updates if you encounter this error.

Relationship with STT Layer

The Conversational Flow layer is the recommended approach for configuring turn-taking behavior with Sparrow-1. This supersedes the legacy Sparrow-0 configuration available in the STT layer via smart_turn_detection.
Legacy Approach: Configuring turn-taking via the STT layer’s smart_turn_detection parameter is a legacy approach that uses Sparrow-0. For new implementations, use the Conversational Flow layer with Sparrow-1 instead.
When you configure the Conversational Flow layer with turn_detection_model set to sparrow-1, these settings override any corresponding settings in the STT layer.

Parameter Mapping: Sparrow-0 to Sparrow-1

Here’s how Sparrow-0 (STT layer) parameters map to Sparrow-1 (Conversational Flow layer):
Sparrow-0 (STT Layer)Sparrow-1 (Conversational Flow Layer)Notes
participant_pause_sensitivityturn_taking_patienceControls how long to wait before responding
participant_interrupt_sensitivitypal_interruptibilityControls How easily the PAL can be interrupted
Important: When using Sparrow-1 via the Conversational Flow layer, any conflicting settings in the STT layer (Sparrow-0) will be overridden. For example, if you set participant_pause_sensitivity: "high" in the STT layer but turn_taking_patience: "low" in the Conversational Flow layer with turn_detection_model: "sparrow-1", the Conversational Flow setting (low) will take precedence.

Migration Guide

If you’re currently using Sparrow-0 settings in the STT layer and want to upgrade to Sparrow-1:Before (Sparrow-0):
{
  "layers": {
    "stt": {
      "participant_pause_sensitivity": "high",
      "participant_interrupt_sensitivity": "low"
    }
  }
}
After (Sparrow-1):
{
  "layers": {
    "conversational_flow": {
      "turn_detection_model": "sparrow-1",
      "turn_taking_patience": "low",
      "pal_interruptibility": "high",
      "voice_isolation": "near"
    }
  }
}
Note the inverted mapping:
  • participant_pause_sensitivity: "high" (quick response) → turn_taking_patience: "low" (eager)
  • participant_interrupt_sensitivity: "low" (hard to interrupt) → pal_interruptibility: "high" (easy to interrupt)
The naming has been updated in Sparrow-1 to be more intuitive from the face’s perspective.
Recommended: Enable voice_isolation by setting it to "near" when migrating to the Conversational Flow layer. This filters background noise from the participant’s microphone audio, improving turn detection accuracy and overall conversation quality. Learn more in the Voice Isolation documentation.
The voice_settings parameter allows additional settings specific to the selected TTS engine. These settings vary per engine:
ParameterCartesia (Sonic-1 only)ElevenLabs
speedRange -1.0 to 1.0 (negative = slower, positive = faster)Range 0.7 to 1.2 (0.7 = slowest, 1.2 = fastest)
emotionArray of "emotion:level" tags (e.g., "positivity:high")Not available
stabilityNot availableRange 0.0 to 1.0 (0.0 = variable, 1.0 = stable)
similarity_boostNot availableRange 0.0 to 1.0 (0.0 = creative, 1.0 = original)
styleNot availableRange 0.0 to 1.0 (0.0 = neutral, 1.0 = exaggerated)
use_speaker_boostNot availableBoolean (enhances speaker similarity)
For more information on each voice setting, see:
Cartesia Speed and Emotion Controls
ElevenLabs Voice Settings
"tts": {
  "voice_settings": {
    "speed": 0.5,
    "emotion": ["positivity:high", "curiosity"]
  }
}
This is a legacy approach. The recommended method for controlling emotion, speed, and volume is now outlined in the TTS documentation.
Raven-1 is now the default perception model. If you’re upgrading from raven-0 or using legacy field names, here’s how to migrate.

Field Name Changes

The following fields have been renamed for clarity. Legacy names are still supported but deprecated:
Legacy Field NameNew Field NameNotes
ambient_awareness_queriesvisual_awareness_queriesVisual stream monitoring
perception_tool_promptvisual_tool_promptInstructions for visual tools
perception_toolsvisual_toolsVisual-triggered functions
tool_prompt(removed)Use visual_tool_prompt or audio_tool_prompt instead

New Audio Fields (Raven-1 only)

Raven-1 introduces audio perception capabilities with these new fields:
Field NameDescription
audio_awareness_queriesCustom queries monitoring the audio stream
audio_tool_promptInstructions for audio-triggered tools
audio_toolsFunctions triggered by audio analysis

Migration Example

Before (raven-0 with legacy field names):
{
  "layers": {
    "perception": {
      "perception_model": "raven-0",
      "ambient_awareness_queries": ["Is the user showing an ID?"],
      "perception_tool_prompt": "Use notify_id when an ID is detected.",
      "perception_tools": [
        {
          "type": "function",
          "function": {
            "name": "notify_id",
            "description": "Notify when ID is detected"
          }
        }
      ]
    }
  }
}
After (raven-1 with current field names):
{
  "layers": {
    "perception": {
      "perception_model": "raven-1",
      "visual_awareness_queries": ["Is the user showing an ID?"],
      "visual_tool_prompt": "Use notify_id when an ID is detected.",
      "visual_tools": [
        {
          "type": "function",
          "function": {
            "name": "notify_id",
            "description": "Notify when ID is detected"
          }
        }
      ],
      "audio_awareness_queries": ["Does the user sound frustrated?"]
    }
  }
}
Raven-1 includes all visual capabilities from raven-0, plus new audio perception. You don’t need to change your visual configuration - just update field names and optionally add audio queries.For tool definitions, new integrations should use the tools registry (/v2/tools + attach) instead of inline visual_tools / audio_tools. See Legacy inline tool calling if you still rely on inline tools.
The Tools overview documents the current approach: create standalone tools at /v2/tools, attach them with /v2/pals/{pal_id}/tools, and configure delivery, auth, and face behavior (on_call, on_resolve) per tool.Inline tools are the older pattern: OpenAI-style function objects embedded directly on the PAL under layers.llm.tools or layers.perception.*. They remain supported at runtime for existing PALs, but are deprecated for new integrations.
Inline tools only support app-message delivery to your frontend. They do not support API delivery, outbound auth, on_call / on_resolve, or attaching the same tool definition to multiple PALs. Use the tools registry for those capabilities.

Where inline tools live

Tool typeLegacy inline locationPatch path (JSON Patch)Event when fired
LLM (speech-triggered)layers.llm.tools/layers/llm/toolsconversation.tool_call
Vision (Raven sees)layers.perception.visual_tools/layers/perception/visual_toolsconversation.perception_tool_call
Audio (Raven hears)layers.perception.audio_tools/layers/perception/audio_toolsconversation.perception_tool_call
Legacy perception field names (perception_tools, perception_tool_prompt, ambient_awareness_queries) still map to the current names - see Migration from Legacy Perception to Raven-1 above.Pair inline perception tools with the matching prompt field:
  • Vision: visual_tool_prompt (or legacy perception_tool_prompt)
  • Audio: audio_tool_prompt (Raven-1 only)
Vision and audio inline tools require perception_model: "raven-1" (recommended) or raven-0 for vision-only legacy setups.

Inline tool shape

Each entry uses OpenAI function calling format:
{
  "type": "function",
  "function": {
    "name": "get_current_weather",
    "description": "Get the current weather in a given location",
    "parameters": {
      "type": "object",
      "properties": {
        "city": {
          "type": "string",
          "description": "City name"
        }
      },
      "required": ["city"]
    }
  }
}
Function names must match ^[a-zA-Z_][a-zA-Z0-9_]{0,63}$.

How inline LLM tools run

  1. The user speaks; the conversational LLM decides to invoke a function.
  2. Tavus broadcasts conversation.tool_call to your Daily room with a tool_call_id.
  3. Your frontend executes the logic and returns a result via conversation.tool_result with the matching tool_call_id (unless the tool is configured as fire-and-forget in a registry attachment - inline tools have no on_resolve control).
Tavus does not execute tool calls on your backend. Your client must listen for events and respond. See Tool Calling for LLM for the registry flow, which adds API delivery and face behavior controls.

Example: inline LLM tools on create

{
  "pal_name": "Weather Assistant",
  "system_prompt": "You help users check the weather.",
  "pipeline_mode": "full",
  "default_face_id": "r90bbd427f71",
  "layers": {
    "llm": {
      "model": "tavus-gpt-oss",
      "tools": [
        {
          "type": "function",
          "function": {
            "name": "get_current_weather",
            "description": "Get the current weather for a city",
            "parameters": {
              "type": "object",
              "properties": {
                "city": { "type": "string", "description": "City name" }
              },
              "required": ["city"]
            }
          }
        }
      ]
    }
  }
}

Example: patch inline LLM tools

curl --request PATCH \
  --url https://tavusapi.com/v2/pals/{pal_id} \
  --header 'Content-Type: application/json' \
  --header 'x-api-key: <api-key>' \
  --data '[
    {
      "op": "replace",
      "path": "/layers/llm/tools",
      "value": [
        {
          "type": "function",
          "function": {
            "name": "get_current_weather",
            "description": "Get the current weather for a city",
            "parameters": {
              "type": "object",
              "properties": {
                "city": { "type": "string", "description": "City name" }
              },
              "required": ["city"]
            }
          }
        }
      ]
    }
  ]'

How inline perception tools run

Perception tools fire when Raven detects a visual or audio cue matching the tool’s description. They run in parallel with the conversational LLM - the PAL keeps speaking normally. Inline perception tools are effectively fire-and-forget: Tavus does not wait for or speak a tool result on the conversational side.
  1. Raven matches a cue to an inline tool definition.
  2. Tavus broadcasts conversation.perception_tool_call (modality: "vision" or "audio").
  3. Your frontend handles the event. Returning conversation.tool_result is optional and rarely needed unless you also attach a registry tool with different on_resolve behavior.

Example: inline vision tools

{
  "layers": {
    "perception": {
      "perception_model": "raven-1",
      "visual_awareness_queries": [
        "Is the user showing an ID card?"
      ],
      "visual_tool_prompt": "You have a tool named `notify_if_id_shown`. Use it when an ID card is clearly visible.",
      "visual_tools": [
        {
          "type": "function",
          "function": {
            "name": "notify_if_id_shown",
            "description": "Trigger when a driver's license or passport is clearly visible",
            "parameters": {
              "type": "object",
              "properties": {
                "id_type": {
                  "type": "string",
                  "description": "Best guess on ID type"
                }
              },
              "required": ["id_type"]
            }
          }
        }
      ]
    }
  }
}

Example: patch inline vision tools

curl --request PATCH \
  --url https://tavusapi.com/v2/pals/{pal_id} \
  --header 'Content-Type: application/json' \
  --header 'x-api-key: <api-key>' \
  --data '[
    {
      "op": "replace",
      "path": "/layers/perception/visual_tools",
      "value": [
        {
          "type": "function",
          "function": {
            "name": "detect_glasses",
            "description": "Trigger when the user is wearing glasses",
            "parameters": {
              "type": "object",
              "properties": {
                "glasses_type": { "type": "string" }
              },
              "required": ["glasses_type"]
            }
          }
        }
      ]
    }
  ]'
Use /layers/perception/audio_tools for inline audio tools the same way.

Mixing inline and registry tools

At conversation start, Tavus merges inline definitions with tools attached via /v2/pals/{pal_id}/tools:
  • LLM tools: Registry attachments and inline layers.llm.tools are both advertised to the model. If the same name exists in both places, the registry attachment wins.
  • Perception tools: Inline visual_tools / audio_tools are concatenated with attached vision/audio registry tools.
List PAL Tools returns only registry attachments - inline tools are not listed there.

Migrating to the tools registry

Before (inline LLM tool):
"layers": {
  "llm": {
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_current_weather",
          "description": "Get the current weather for a city",
          "parameters": {
            "type": "object",
            "properties": { "city": { "type": "string" } },
            "required": ["city"]
          }
        }
      }
    ]
  }
}
After (registry + attach):
# 1. Create once
curl --request POST \
  --url https://tavusapi.com/v2/tools \
  --header 'Content-Type: application/json' \
  --header 'x-api-key: <api-key>' \
  --data '{
    "name": "get_current_weather",
    "description": "Get the current weather for a city",
    "parameters": {
      "type": "object",
      "properties": { "city": { "type": "string", "description": "City name" } },
      "required": ["city"]
    },
    "origin": "llm",
    "delivery": { "app_message": true }
  }'

# 2. Attach to PAL (use tool_id from the response)
curl --request POST \
  --url https://tavusapi.com/v2/pals/{pal_id}/tools \
  --header 'Content-Type: application/json' \
  --header 'x-api-key: <api-key>' \
  --data '{ "tool_ids": ["<tool_id>"] }'

# 3. Remove inline copy (optional but recommended)
curl --request PATCH \
  --url https://tavusapi.com/v2/pals/{pal_id} \
  --header 'Content-Type: application/json' \
  --header 'x-api-key: <api-key>' \
  --data '[{ "op": "replace", "path": "/layers/llm/tools", "value": [] }]'
For perception tools, set "origin": "vision" or "origin": "audio" on create instead of embedding under layers.perception. Attaching a vision or audio tool automatically bumps perception_model to raven-1 when needed.See Tool Calling for LLM, Tool Calling for Perception, and Tool Delivery for the current reference.

Face

Face training can fail when training footage does not meet format, quality, or rights requirements.Check that your video follows Training from a video and Which training path?. You must have the necessary rights and permissions to use the likeness, voice, and footage you submit - see Platform Policies.If training still fails, review the error message in the PAL Maker or in Face training errors, then submit a new request through the PAL Maker or Create Face API.
If your face’s lip movements are noticeably out of sync, it may be due to issues with the training video format. Even if the video appears clean, AI-generated content or videos that don’t follow the expected structure can affect training quality.Common causes:
  • The video does not follow the required recording format, which includes:
    • 1 minute of talking
    • 1 minute of silence
  • Lips do not fully close during the talking segment, which limits the model’s ability to learn realistic lip movements.
To improve your face:
  • Record a new video following the correct structure (one minute of talking followed by one minute of silence).
  • Speak naturally, allowing full lip movement including closures.
  • Avoid using AI-generated videos for training.
For more details, see Which training path? and Training from a video.

Video Generation

If your video looks unnatural or has repeated gestures, it may be due to the script length. Videos over 5 minutes can lead to reduced movement variety and a less natural feel.To improve quality:
  1. Keep videos short – under 5 minutes is ideal.
  2. Break long scripts into smaller, focused segments.
  3. Tighten the script – remove filler and keep pacing steady.
  4. Use multiple faces for variety in longer content.
  5. Review and revise – check for repetition and adjust as needed.
If the issue persists after following the troubleshooting guide above, please don’t hesitate to contact our support team for further assistance.