Skip to main content
The Perception Layer in Tavus enhances an AI agent with real-time visual and audio understanding. By using Raven, the AI agent becomes more context-aware, responsive, and capable of triggering actions based on visual and audio input.

Configuring the Perception Layer

To configure the Perception Layer, define the following parameters within the layers.perception object:

1. perception_model

Specifies the perception model to use.
  • Options:
    • raven-1 (default and recommended): Real-time emotional understanding from user audio, more natural and human-like interactions, plus advanced visual perception.
    • off: Disables the perception layer.
Screen Share Feature: When using Raven, screen share is enabled by default without additional configuration.

Audio Perception

Raven-1 (the default) analyzes user tone and emotion in real-time. This context is automatically sent to the LLM alongside utterances, enabling more natural, empathetic responses. For example:
<user_audio_analysis>The user sounded sarcastic when they said this</user_audio_analysis>
Wow, I love Mondays.
Audio analysis tags are stripped from transcription callbacks.
Audio analysis output is limited to 32 tokens per utterance.

Perception Analysis Queries

Raven supports three kinds of queries that differ by when they run and how they affect the call:
  • perception_analysis_queries - Evaluated only at end of call. They do not change live behavior; they only shape the summary you get in the Perception Analysis event sent to your conversation callback.
  • visual_awareness_queries and audio_awareness_queries - Evaluated throughout the call. Their answers are passed to the LLM as context, so the PAL can react in real time. You receive this ongoing analysis in each user turn via the Utterance event as user_visual_analysis and user_audio_analysis.
Use visual_awareness_queries and audio_awareness_queries when you want the PAL to be aware of or focus on something specific during the conversation. Use perception_analysis_queries when you want your end-of-call summary to address specific points.

Visual Perception Configuration

2. visual_awareness_queries

An array of custom queries that Raven continuously monitors in the visual stream.
"visual_awareness_queries": [
  "Is the user wearing a bright outfit?"
]
Queries that Raven evaluates continuously during the call (on the order of every second). The answers are fed into the rolling visual context for the LLM, so the PAL can respond to what it “sees.” This same context also supports the end-of-call summary. You can read the ongoing visual analysis for each user utterance in the Utterance event as user_visual_analysis.When to use: when you want the PAL to pay attention to something visual in real time (e.g. expression, clothing, objects on screen).Example:
"visual_awareness_queries": [
  "What is the main expression on the user's face?",
  "Is the user wearing a jacket?",
  "Does the user appear distressed or uncomfortable?"
]

3. perception_analysis_queries

An array of custom queries that Raven processes at the end of the call to generate a visual analysis summary for the user.
Queries that are answered once, at the end of the call, by looking at what was observed over the whole conversation. They do not affect the call itself-only the content of the end-of-call summary. (Currently the summary is visual only; naming is kept general for future support.)When to use: When you want the post-call report to answer specific questions (e.g. “Did the user ever have two people on screen?”, “How often was the user looking at the screen?”).Example:
"perception_analysis_queries": [
  "On a scale of 1-100, how often was the user looking at the screen?",
  "Is there any indication that more than one person is present?"
]
The answers are delivered in a Perception Analysis event. Example payload:
{
  "properties": {
    "analysis": "**User's Gaze Toward Screen:** The participant looked at the screen approximately 75% of the time.\n\n**Multiple People Present:** No indication of additional participants was detected during the call."
  },
  "conversation_id": "<conversation_id>",
  "event_type": "application.perception_analysis",
  "timestamp": "2025-07-11T09:13:35.361736Z"
}
You do not need to set visual_awareness_queries in order to use perception_analysis_queries.
"perception_analysis_queries": [
  "Is the user wearing multiple bright colors?",
  "Is there any indication that more than one person is present?",
  "On a scale of 1-100, how often was the user looking at the screen?"
]
Best practices for visual_awareness_queries and perception_analysis_queries:
  • Use simple, focused prompts.
  • Use queries that support your PAL’s purpose.
All Raven API parameters (queries, prompts, tool definitions, etc.) have a 10,000 character limit per entry. Entries exceeding this limit will cause an exception.

4. visual_tool_prompt

Tell Raven when and how to trigger tools based on what it sees.
"visual_tool_prompt":
  "You have a tool to notify the system when a bright outfit is detected, named `notify_if_bright_outfit_shown`. You MUST use this tool when a bright outfit is detected."

5. visual_tools

Legacy inline perception tools. For new integrations, create vision tools at /v2/tools with origin: "vision" and attach them to the PAL - see Tool Calling for Perception. The field below defines OpenAI-style function objects directly on the PAL. Tavus still merges them at conversation start alongside any registry tools you attach, but inline tools cannot use registry-only settings (delivery, API auth, etc.).
"visual_tools": [
  {
    "type": "function",
    "function": {
      "name": "notify_if_bright_outfit_shown",
      "description": "Use this function when a bright outfit is detected in the image with high confidence",
      "parameters": {
        "type": "object",
        "properties": {
          "outfit_color": {
            "type": "string",
            "description": "Best guess on what color of outfit it is"
          }
        },
        "required": ["outfit_color"]
      }
    }
  }
]
Legacy field names (perception_tools, perception_tool_prompt) still work - see Migration from Legacy Perception to Raven-1. For the full legacy inline reference, see Legacy inline tool calling.

Audio Perception Configuration (Raven-1)

The following fields are available when using raven-1 and enable custom audio-based perception capabilities.

6. audio_awareness_queries

An array of custom queries that Raven-1 continuously monitors in the audio stream. Use these to track specific audio patterns or user states.
Audio analysis output is limited to 32 tokens per query response.
"audio_awareness_queries": [
  "Does the user sound frustrated or confused?",
  "Is the user speaking quickly as if in a hurry?"
]
Queries that Raven-1 evaluates continuously during the call on the audio stream. The answers are passed to the LLM as context so the PAL can respond to tone and delivery. You can read the ongoing audio analysis for each user utterance in the Utterance event as user_audio_analysis. (There is no separate end-of-call summary for audio.)When to use: when you want the PAL to react to how the user sounds (e.g. frustrated, confused, in a hurry).Example:
"audio_awareness_queries": [
  "Does the user sound frustrated or confused?",
  "Is the user speaking quickly as if in a hurry?"
]

7. audio_tool_prompt

Tell Raven-1 when and how to trigger tools based on what it hears (beyond the automatic emotion analysis).
"audio_tool_prompt":
  "You have a tool to escalate to a human agent when the user sounds very frustrated, named `escalate_to_human`. Use this tool when detecting sustained frustration."

8. audio_tools

Legacy inline perception tools. For new integrations, create audio tools at /v2/tools with origin: "audio" and attach them to the PAL - see Tool Calling for Perception.
"audio_tools": [
  {
    "type": "function",
    "function": {
      "name": "escalate_to_human",
      "description": "Escalate the conversation to a human agent when user frustration is detected",
      "parameters": {
        "type": "object",
        "properties": {
          "reason": {
            "type": "string",
            "description": "The reason for escalation"
          }
        },
        "required": ["reason"]
      }
    }
  }
]
Requires perception_model: "raven-1". Legacy inline details: Legacy inline tool calling.

Example Configurations

The JSON below uses legacy inline visual_tools / audio_tools for illustration. New PALs should define tools in the tools registry instead.
This example demonstrates a PAL that monitors for visual cues (bright outfits) and triggers a tool when detected.
{
  "pal_name": "Fashion Advisor",
  "system_prompt": "As a Fashion Advisor, you specialize in offering tailored fashion advice.",
  "pipeline_mode": "full",
  "default_face_id": "r90bbd427f71",
  "layers": {
    "perception": {
      "perception_model": "raven-1",
      "visual_awareness_queries": [
        "Is the user wearing a bright outfit?"
      ],
      "perception_analysis_queries": [
        "Is the user wearing multiple bright colors?",
        "On a scale of 1-100, how often was the user looking at the screen?"
      ],
      "visual_tool_prompt": "You have a tool to notify the system when a bright outfit is detected, named `notify_if_bright_outfit_shown`. You MUST use this tool when a bright outfit is detected.",
      "visual_tools": [
        {
          "type": "function",
          "function": {
            "name": "notify_if_bright_outfit_shown",
            "description": "Use this function when a bright outfit is detected in the image with high confidence",
            "parameters": {
              "type": "object",
              "properties": {
                "outfit_color": {
                  "type": "string",
                  "description": "Best guess on what color of outfit it is"
                }
              },
              "required": ["outfit_color"]
            }
          }
        }
      ]
    }
  }
}
This example demonstrates a PAL that monitors user tone and escalates to a human agent when sustained frustration is detected.
{
  "pal_name": "Support Agent",
  "system_prompt": "You are a helpful customer support agent.",
  "pipeline_mode": "full",
  "default_face_id": "r90bbd427f71",
  "layers": {
    "perception": {
      "perception_model": "raven-1",
      "audio_awareness_queries": [
        "Does the user sound frustrated or confused?",
        "Is the user speaking quickly as if in a hurry?"
      ],
      "audio_tool_prompt": "You have a tool to escalate to a human agent when the user sounds very frustrated, named `escalate_to_human`. Use this tool when detecting sustained frustration.",
      "audio_tools": [
        {
          "type": "function",
          "function": {
            "name": "escalate_to_human",
            "description": "Escalate the conversation to a human agent when user frustration is detected",
            "parameters": {
              "type": "object",
              "properties": {
                "reason": {
                  "type": "string",
                  "description": "The reason for escalation"
                }
              },
              "required": ["reason"]
            }
          }
        }
      ]
    }
  }
}
Please see the Create a PAL endpoint for more details.