Raven-0 is a revolutionary real-time multimodal vision and video understanding system that fundamentally reimagines how AI perceives and interacts with humans. Unlike traditional systems that rely on frame-by-frame analysis, Raven-0 implements a context-aware, human-like perception system that mirrors the primary visual cortex’s functioning.

By default we always recommend to use as much of the CVI end-to-end pipeline as possible to guarantee the lowest latency and provide the best experience for your customers.

Key Capabilities

Raven-0 provides advanced perception capabilities that go far beyond traditional vision systems:

How Raven Works

Raven-0 implements a dual-track vision processing system that mirrors human perception:

Ambient Perception

Ambient perception acts as the replica’s “eyes,” continuously processing and understanding the visual environment at a low level. This provides ambient context that informs the replica’s responses without requiring explicit queries.

  • Default Queries: Raven automatically processes visual information to understand who the user is, what they look like, their emotional state, and other contextual information.
  • Custom Queries: You can define custom visual queries that Raven will continuously monitor for, allowing for specialized use cases.

Active Perception

When specific visual information is needed, Raven can perform detailed on-demand analysis:

  • Speculative Execution: Raven uses speculative execution to pre-process likely visual queries while the user is speaking, minimizing perceived latency.

Screenshare Vision

Raven processes screen content with higher detail retention, allowing for animations, dynamic content and pages to be captured. This way, you can share your calendar, documents, and other content with your replica, and switching between screens is seamless.

End-of-call Perception Analysis

At the end of a call, Raven will summarize the visual artifacts that were detected throughout the call. This is a feature that is only available when the persona has raven-0 specified in the perception layer, and will be broadcasted as a Perception Analysis event and separately as a conversation callback.

Use Cases

Configuring Raven

You can configure Raven’s behavior through the Create Persona API by adjusting the perception parameters.

Perception Parameters

layers.perception.perception_model
string

The perception model to use. Options include raven-0 for advanced multimodal perception or basic for simpler vision capabilities, and off to disable all perception.

layers.perception.ambient_awareness_queries
array

Custom queries that Raven will continuously monitor for in the visual stream. These provide ambient context without requiring explicit prompting. These allow the replica to be aware of these additional visual cues.

layers.perception.perception_tool_prompt
string

A prompt that details how and when to use the tools that are passed to the perception layer. This helps the replica understand the context of the perception tools and grounds it.

layers.perception.perception_tools
array

Tools that can be triggered based on visual context, enabling automated actions in response to visual cues from your system.

Example Configuration

{
  "layers": {
    "perception": {
        "perception_model": "raven-0",
        "ambient_awareness_queries": [
            "Is the user showing an ID card?",
            "Is the user wearing a mask?"
        ],
        "perception_tool_prompt": "You have a tool to notify the system when an ID card is detected, named `notify_if_id_shown`.",
        "perception_tools": [
            {
            "name": "notify_if_id_shown",
            "description": "Notify the system when an ID card is detected"
            }
        ]
    }
  }
}