Tool Calling for Perception

Perception tool calling lets the PAL trigger functions based on visual or audio cues the perception model (Raven) detects during a conversation, in parallel with the main LLM turn. Tools are reusable objects: create them once, and attach them to any number of PALs. The only difference from LLM tools is origin.

This page documents the tools registry (/v2/tools with origin: "vision" or "audio"). If your PAL still embeds tools under layers.perception.visual_tools or layers.perception.audio_tools, see Legacy inline tool calling.

Perception tool calling is only available with Raven (perception_model: "raven-1" on the PAL’s perception layer).

How Perception Tools Work

Perception runs as a parallel step alongside the conversational LLM. Raven analyses the audio and video streams continuously and fires a tool the moment it detects something matching one of the tool descriptions you defined. There are two flavors, picked via the tool’s origin:

Vision tools (origin: "vision") - triggered by what Raven sees in the video stream (e.g. an ID card, a bright outfit, a hat).
Audio tools (origin: "audio") - triggered by what Raven hears in the audio stream (e.g. sarcasm, sustained frustration).

Because perception runs in parallel, the PAL keeps speaking and listening normally while a perception tool dispatches. Perception tools are fire-and-forget: the PAL does not pause, fill, or react to the result on the conversational side.

Defining a Perception Tool

The name, description, parameters, and delivery fields work the same way they do for LLM tools - see Tool Calling for LLM for the full reference.

Field	Type	Required	Description
`name`	string	✅	Unique identifier, scoped to your account. Must match `^[a-zA-Z_][a-zA-Z0-9_]{0,63}$`.
`description`	string	✅	What Raven should look or listen for. Be specific - this is what triggers the tool.
`parameters`	object	❌	JSON Schema for the arguments Raven extracts when the cue is detected.
`origin`	string	✅	`"vision"` or `"audio"`.
`delivery`	object	❌	Defaults to `{"app_message": true}`. API is also supported (same shape as LLM tools).

You do not need to set on_call, on_resolve, or static_filler on a perception tool. Omit them and the API applies the only allowed values (null, "fire_and_forget", null respectively). Passing any other value returns a 400.

Vision Tool Example

Create a vision tool

curl --request POST \
  --url https://tavusapi.com/v2/tools \
  --header 'Content-Type: application/json' \
  --header 'x-api-key: <api-key>' \
  --data '{
    "name": "notify_if_id_shown",
    "description": "Trigger when a driver'\''s license or passport is clearly visible in the video stream with high confidence.",
    "parameters": {
      "type": "object",
      "properties": {
        "id_type": {
          "type": "string",
          "description": "Best guess on what type of ID it is"
        }
      },
      "required": ["id_type"]
    },
    "origin": "vision"
  }'

When Raven detects an ID in frame, your application receives a conversation.perception_tool_call event with modality: "vision", the name, structured arguments, and a frames array of base64-encoded images that triggered the call.

Audio Tool Example

Create an audio tool

curl --request POST \
  --url https://tavusapi.com/v2/tools \
  --header 'Content-Type: application/json' \
  --header 'x-api-key: <api-key>' \
  --data '{
    "name": "notify_sarcasm_detected",
    "description": "Trigger when the user'\''s tone or phrasing suggests sarcasm.",
    "parameters": {
      "type": "object",
      "properties": {
        "reason": {
          "type": "string",
          "description": "Why you detected sarcasm (e.g. what the user said)"
        }
      },
      "required": ["reason"]
    },
    "origin": "audio"
  }'

When Raven hears the cue, your application receives a conversation.perception_tool_call event with modality: "audio" and the structured arguments.

Attaching to a PAL

Perception tools are attached the same way as LLM tools:

Attach perception tools

curl --request POST \
  --url https://tavusapi.com/v2/pals/{pal_id}/tools \
  --header 'Content-Type: application/json' \
  --header 'x-api-key: <api-key>' \
  --data '{
    "tool_ids": ["tabc123def456"]
  }'

The same PAL can hold both LLM and perception tools. Make sure the PAL’s perception layer has perception_model: "raven-1" for vision and audio tools to fire.

Delivery

Perception tools use the same delivery field as LLM tools - see Tool Delivery and Tool Authentication. The only perception-specific bit: the app-message event is conversation.perception_tool_call (not conversation.tool_call).

Because perception tools are fire-and-forget, the response body your API returns is not consumed by the conversational LLM. A 2xx is enough to acknowledge receipt; a non-2xx is logged but does not affect the conversation.

Replace <api-key> with your actual API key. You can generate one in the PAL Maker.

Getting started

Build

Deploy

Debug

Guides

Resources

Tool Calling for Perception

How Perception Tools Work

Defining a Perception Tool

Vision Tool Example

Audio Tool Example

Attaching to a PAL

Delivery

​How Perception Tools Work

​Defining a Perception Tool

​Vision Tool Example

​Audio Tool Example

​Attaching to a PAL

​Delivery

How Perception Tools Work

Defining a Perception Tool

Vision Tool Example

Audio Tool Example

Attaching to a PAL

Delivery