Skip to main content
The LLM Layer in Tavus enables your persona to generate intelligent, context-aware responses. You can use Tavus-hosted models or connect your own OpenAI-compatible LLM.

Tavus-Hosted Models

1. model

Select one of the available models. tavus-gpt-oss is recommended as a good starting point; the table below helps you choose based on your priorities.
ModelSpeedIntelligenceNaturalnessBest For
tavus-gpt-ossβš‘βš‘βš‘πŸ§ πŸ’¬Snappy, low-latency
tavus-gpt-4.1 (deprecated)βš‘βš‘πŸ§ πŸ§ πŸ§ πŸ’¬πŸ’¬πŸ’¬Long-context reasoning
tavus-gpt-4o (deprecated)βš‘βš‘πŸ§ πŸ§ πŸ’¬πŸ’¬Legacy option
tavus-gemini-2.5-flashβš‘βš‘πŸ§ πŸ§ πŸ’¬πŸ’¬πŸ’¬Latency + logical deduction
tavus-claude-haiku-4.5βš‘βš‘πŸ§ πŸ§ πŸ’¬πŸ’¬Grounded, fewer hallucinations
tavus-gpt-5.2βš‘βš‘πŸ§ πŸ§ πŸ’¬πŸ’¬General use, latency less critical
tavus-gpt-4o-mini (deprecated)βš‘βš‘πŸ§ πŸ’¬πŸ’¬Legacy option
tavus-gemini-3-flashβš‘πŸ§ πŸ§ πŸ§ πŸ’¬πŸ’¬πŸ’¬Highest intelligence, lower speed
Context Window Limit
  • Performance and intelligence are best when prompts are limited to 5,000 tokens. You may see degradations in speed and instruction following in the 15,000–20,000 token range.
  • All Tavus-hosted models support up to 32,000 tokens; staying within 5k is recommended for optimal behavior.
Tip: 1 token β‰ˆ 4 characters, so 5,000 tokens β‰ˆ 20,000 characters (including spaces and punctuation).
"model": "tavus-gpt-oss"

2. tools

Optionally enable tool calling by defining functions the LLM can invoke.
Please see LLM Tool Calling for more details.

3. speculative_inference

When set to true, the LLM begins processing speech transcriptions before user input ends, improving responsiveness. This is the default value; you can set it to false to disable.
"speculative_inference": true
This field is optional. It defaults to true for better performance.

4. extra_body

Add parameters to customize the LLM request. For Tavus-hosted models, you can pass temperature and top_p:
"extra_body": {
  "temperature": 0.7,
  "top_p": 0.9
}
This field is optional.

Example Configuration

{
  "persona_name": "Health Coach",
  "system_prompt": "You provide wellness tips and encouragement for people pursuing a healthy lifestyle.",
  "pipeline_mode": "full",
  "default_replica_id": "rf4e9d9790f0",
  "layers": {
    "llm": {
      "model": "tavus-gpt-oss",
      "speculative_inference": true,
      "extra_body": {
        "temperature": 0.7,
        "top_p": 0.9
      }
    }
  }
}

Custom LLMs

Prerequisites

To use your own OpenAI-compatible LLM, you’ll need:
  • Model name
  • Base URL
  • API key
Ensure your LLM:
  • Streamable (ie. via SSE)
  • Uses the /chat/completions endpoint

1. model

Name of the custom model you want to use.
"model": "gpt-3.5-turbo"

2. base_url

Base URL of your LLM endpoint.
Do not include route extensions in the base_url.
"base_url": "https://your-llm.com/api/v1"

3. api_key

API key to authenticate with your LLM provider.
"api_key": "your-api-key"
base_url and api_key are required only when using a custom model.

4. tools

Optionally enable tool calling by defining functions the LLM can invoke.
Please see LLM Tool Calling for more details.

5. speculative_inference

When set to true, the LLM begins processing speech transcriptions before user input ends, improving responsiveness. This is the default value; you can set it to false to disable.
"speculative_inference": true
This field is optional. It defaults to true for better performance.

6. headers

Optional headers for authenticating with your LLM.
"headers": {
  "Authorization": "Bearer your-api-key"
}
This field is optional, depending on your LLM model.

7. extra_body

Add parameters to customize the LLM request. You can pass any parameters that your LLM provider supports:
"extra_body": {
  "temperature": 0.5,
  "top_p": 0.9,
  "frequency_penalty": 0.5
}
This field is optional.

8. default_query

Add default query parameters that get appended to the base URL when making requests to the /chat/completions endpoint.
"default_query": {
  "api-version": "2024-02-15-preview"
}
This field is optional. Useful for LLM providers that require query parameters for authentication or versioning.

Example Configuration

{
  "persona_name": "Storyteller",
  "system_prompt": "You are a storyteller who entertains people of all ages.",
  "pipeline_mode": "full",
  "default_replica_id": "rf4e9d9790f0",
  "layers": {
    "llm": {
      "model": "gpt-4o",
      "base_url": "https://your-azure-openai.openai.azure.com/openai/deployments/gpt-4o",
      "api_key": "your-api-key",
      "speculative_inference": true,
      "default_query": {
        "api-version": "2024-02-15-preview"
      }
    }
  }
}
Refer to the Create Persona API for a full list of supported fields.

Perception

When using the raven-1 perception model with a custom LLM, your LLM will receive system messages containing visual context extracted from the user’s video input.
{
    "role": "system",
    "content": "<user_appearance>...</user_appearance> <user_emotions>...</user_emotions> <user_screenshare>...</user_screenshare>"
}

Disabled Perception model

If you disable the perception model, your LLM will not receive any special messages.