Skip to main content
The Perception Layer in Tavus enhances an AI agent with real-time visual and audio understanding. By using Raven, the AI agent becomes more context-aware, responsive, and capable of triggering actions based on visual and audio input.

Configuring the Perception Layer

To configure the Perception Layer, define the following parameters within the layers.perception object:

1. perception_model

Specifies the perception model to use.
  • Options:
    • raven-1 (default and recommended): Real-time emotional understanding from user audio, more natural and human-like interactions, plus all visual capabilities from raven-0.
    • raven-0 (legacy settings here)
    • off: Disables the perception layer.
Screen Share Feature: When using Raven, screen share is enabled by default without additional configuration.

Audio Perception

Raven-1 (the default) analyzes user tone and emotion in real-time. This context is automatically sent to the LLM alongside utterances, enabling more natural, empathetic responses. For example:
<user_audio_analysis>The user sounded sarcastic when they said this</user_audio_analysis>
Wow, I love Mondays.
Audio analysis tags are stripped from transcription callbacks.

Perception Analysis Queries

Raven supports three kinds of queries that differ by when they run and how they affect the call:
  • perception_analysis_queries — Evaluated only at end of call. They do not change live behavior; they only shape the summary you get in the Perception Analysis event sent to your conversation callback.
  • visual_awareness_queries and audio_awareness_queries — Evaluated throughout the call. Their answers are passed to the LLM as context, so the replica can react in real time. You receive this ongoing analysis in each user turn via the Utterance event as user_visual_analysis and user_audio_analysis.
Use visual_awareness_queries and audio_awareness_queries when you want the replica to be aware of or focus on something specific during the conversation. Use perception_analysis_queries when you want your end-of-call summary to address specific points.

Visual Perception Configuration

2. visual_awareness_queries

An array of custom queries that Raven continuously monitors in the visual stream.
"visual_awareness_queries": [
  "Is the user wearing a bright outfit?"
]
Queries that Raven evaluates continuously during the call (on the order of every second). The answers are fed into the rolling visual context for the LLM, so the replica can respond to what it “sees.” This same context also supports the end-of-call summary. You can read the ongoing visual analysis for each user utterance in the Utterance event as user_visual_analysis.When to use: When you want the replica to pay attention to something visual in real time (e.g. expression, clothing, objects on screen).Example:
"visual_awareness_queries": [
  "What is the main expression on the user's face?",
  "Is the user wearing a jacket?",
  "Does the user appear distressed or uncomfortable?"
]

3. perception_analysis_queries

An array of custom queries that Raven processes at the end of the call to generate a visual analysis summary for the user.
Queries that are answered once, at the end of the call, by looking at what was observed over the whole conversation. They do not affect the call itself—only the content of the end-of-call summary. (Currently the summary is visual only; naming is kept general for future support.)When to use: When you want the post-call report to answer specific questions (e.g. “Did the user ever have two people on screen?”, “How often was the user looking at the screen?”).Example:
"perception_analysis_queries": [
  "On a scale of 1-100, how often was the user looking at the screen?",
  "Is there any indication that more than one person is present?"
]
The answers are delivered in a Perception Analysis event. Example payload:
{
  "properties": {
    "analysis": "**User's Gaze Toward Screen:** The participant looked at the screen approximately 75% of the time.\n\n**Multiple People Present:** No indication of additional participants was detected during the call."
  },
  "conversation_id": "<conversation_id>",
  "event_type": "application.perception_analysis",
  "timestamp": "2025-07-11T09:13:35.361736Z"
}
You do not need to set visual_awareness_queries in order to use perception_analysis_queries.
"perception_analysis_queries": [
  "Is the user wearing multiple bright colors?",
  "Is there any indication that more than one person is present?",
  "On a scale of 1-100, how often was the user looking at the screen?"
]
Best practices for visual_awareness_queries and perception_analysis_queries:
  • Use simple, focused prompts.
  • Use queries that support your persona’s purpose.
All Raven API parameters (queries, prompts, tool definitions, etc.) have a 1,000 character limit per entry. Entries exceeding this limit will cause an exception.

4. visual_tool_prompt

Tell Raven when and how to trigger tools based on what it sees.
"visual_tool_prompt":
  "You have a tool to notify the system when a bright outfit is detected, named `notify_if_bright_outfit_shown`. You MUST use this tool when a bright outfit is detected."

5. visual_tools

Defines callable functions that Raven can trigger upon detecting specific visual conditions. Each tool must include a type and a function object detailing its schema.
"visual_tools": [
  {
    "type": "function",
    "function": {
      "name": "notify_if_bright_outfit_shown",
      "description": "Use this function when a bright outfit is detected in the image with high confidence",
      "parameters": {
        "type": "object",
        "properties": {
          "outfit_color": {
            "type": "string",
            "description": "Best guess on what color of outfit it is"
          }
        },
        "required": ["outfit_color"]
      }
    }
  }
]
Please see Tool Calling for more details.

Audio Perception Configuration (Raven-1)

The following fields are available when using raven-1 and enable custom audio-based perception capabilities.

6. audio_awareness_queries

An array of custom queries that Raven-1 continuously monitors in the audio stream. Use these to track specific audio patterns or user states.
"audio_awareness_queries": [
  "Does the user sound frustrated or confused?",
  "Is the user speaking quickly as if in a hurry?"
]
Queries that Raven-1 evaluates continuously during the call on the audio stream. The answers are passed to the LLM as context so the replica can respond to tone and delivery. You can read the ongoing audio analysis for each user utterance in the Utterance event as user_audio_analysis. (There is no separate end-of-call summary for audio.)When to use: When you want the replica to react to how the user sounds (e.g. frustrated, confused, in a hurry).Example:
"audio_awareness_queries": [
  "Does the user sound frustrated or confused?",
  "Is the user speaking quickly as if in a hurry?"
]

7. audio_tool_prompt

Tell Raven-1 when and how to trigger tools based on what it hears (beyond the automatic emotion analysis).
"audio_tool_prompt":
  "You have a tool to escalate to a human agent when the user sounds very frustrated, named `escalate_to_human`. Use this tool when detecting sustained frustration."

8. audio_tools

Defines callable functions that Raven-1 can trigger based on audio analysis. Each tool must include a type and a function object detailing its schema.
"audio_tools": [
  {
    "type": "function",
    "function": {
      "name": "escalate_to_human",
      "description": "Escalate the conversation to a human agent when user frustration is detected",
      "parameters": {
        "type": "object",
        "properties": {
          "reason": {
            "type": "string",
            "description": "The reason for escalation"
          }
        },
        "required": ["reason"]
      }
    }
  }
]

Example Configurations

This example demonstrates a persona that monitors for visual cues (bright outfits) and triggers a tool when detected.
{
  "persona_name": "Fashion Advisor",
  "system_prompt": "As a Fashion Advisor, you specialize in offering tailored fashion advice.",
  "pipeline_mode": "full",
  "context": "You're having a video conversation with a client about their outfit.",
  "default_replica_id": "r79e1c033f",
  "layers": {
    "perception": {
      "perception_model": "raven-1",
      "visual_awareness_queries": [
        "Is the user wearing a bright outfit?"
      ],
      "perception_analysis_queries": [
        "Is the user wearing multiple bright colors?",
        "On a scale of 1-100, how often was the user looking at the screen?"
      ],
      "visual_tool_prompt": "You have a tool to notify the system when a bright outfit is detected, named `notify_if_bright_outfit_shown`. You MUST use this tool when a bright outfit is detected.",
      "visual_tools": [
        {
          "type": "function",
          "function": {
            "name": "notify_if_bright_outfit_shown",
            "description": "Use this function when a bright outfit is detected in the image with high confidence",
            "parameters": {
              "type": "object",
              "properties": {
                "outfit_color": {
                  "type": "string",
                  "description": "Best guess on what color of outfit it is"
                }
              },
              "required": ["outfit_color"]
            }
          }
        }
      ]
    }
  }
}
This example demonstrates a persona that monitors user tone and escalates to a human agent when sustained frustration is detected.
{
  "persona_name": "Support Agent",
  "system_prompt": "You are a helpful customer support agent.",
  "pipeline_mode": "full",
  "context": "You're helping a customer troubleshoot an issue.",
  "default_replica_id": "r79e1c033f",
  "layers": {
    "perception": {
      "perception_model": "raven-1",
      "audio_awareness_queries": [
        "Does the user sound frustrated or confused?",
        "Is the user speaking quickly as if in a hurry?"
      ],
      "audio_tool_prompt": "You have a tool to escalate to a human agent when the user sounds very frustrated, named `escalate_to_human`. Use this tool when detecting sustained frustration.",
      "audio_tools": [
        {
          "type": "function",
          "function": {
            "name": "escalate_to_human",
            "description": "Escalate the conversation to a human agent when user frustration is detected",
            "parameters": {
              "type": "object",
              "properties": {
                "reason": {
                  "type": "string",
                  "description": "The reason for escalation"
                }
              },
              "required": ["reason"]
            }
          }
        }
      ]
    }
  }
}
Please see the Create a Persona endpoint for more details.