Custom LLM Onboarding
You can integrate an OpenAI-compatible LLM to replace our existing options (tavus-llama
, tavus-gpt-4o
, tavus-gpt-4o-mini
).
Create Persona
To get started, you’ll need to create a Persona that specifies your custom LLM. Here’s an example Persona:
<persona created>, id: p234324a
Launch a Conversation
With this persona, if we were to launch a conversation:
We will see user utterances coming into endpoint you provided with the /chat/completions
suffix as the user speaks during a conversation.
If you set up a test webhook and set the base_url
to point to that webhook’s url, you can examine an incoming chat completion request. You may notice the conversation_id is provided as a request header, and your API key can be used to authenticate requests coming onto your servers.
We make the chat completion request to the URL you provide with these settings:
Which means your OpenAI compatible LLM should be configured to be streamable (ie. send back chunks of chat completions over SSE (Server-side events)). Here is the OpenAI documentation on chat completions as a quick reference point on what to be returning in the request.
Speculative Inference
The speculative_inference
parameter activates speculative inference, a technique that can significantly reduce response times in speech-to-text and natural language processing applications. This can be configured in the Persona.
Overview of Speculative Inference
Speculative inference is an advanced processing technique that allows AI systems to begin generating responses before all input data is available. In the context of speech recognition and natural language processing:
Behavior
When speculative_inference
is set to true
:
The replica will not start to speak until it is confident the user is done speaking; meanwhile progressive transcriptions will be sent to the LLM layer, each one including prior transcriptions accumulating until the replica starts speaking.
Benefits
- Significantly faster response times
- Improved user experience due to reduced latency
- More natural, conversational interaction
Create a Persona with Speculative Inference
<persona created>, id: p234324a
Tools / Function Calling
You can pass in tools (function calls) to your LLM to enable it to perform tasks beyond just text generation. This is useful if you want to integrate external APIs or services into your LLM. Please note that tools are only available for custom LLMs, and require an intermediate layer to be built on your end to handle the tool calls. Currently, we do not run the tools for you.
Here’s a full example of a persona that includes a tool to get the current weather for a given location:
LLM Abstractions
We have abstracted the system such that the LLM instructions receive 3 distinct “sub-instructions” that are concatenated together. Let’s use storytelling as an example persona.
If my goal is to create a storyteller, I can do so with the combination of system_prompt
(Persona), context
(Persona) and conversational_context
(Conversation).
- Now, system_prompt can be something along the lines of:
“You are a storyteller. You like telling stories to people of all ages.”
This defines what a storyteller is. - context is for what that storyteller focuses on: “Here are some of your favorite stories to tell: Little Red Riding Hood, The Ugly Duckling and The Three Little Pigs” This defines what a storyteller has.
- conversational_context is for all the details that revolve around that specific interaction between the user & replica. Something like:
“You are talking to {user_name} (you may pass that in dynamically per conversation request). They are {x} years old. They like listening to {genre} stories.”
This defines who the storyteller is talking to.
This allows you to create as many conversations as you want using the storyteller persona and not share conversation specific context, while also allowing you to create default system prompts on your end and create personas of varying contexts (crime novel storyteller, horror storyteller, children’s storyteller etc).
This would populate the initial system_prompt
of the chat completion request we send your way, and since we send the entire context each time, anything you have in the system_prompt
persists. You may also completely parse the incoming request body and choose what to send your LLM, building your own abstraction in place of what we currently offer.