> ## Documentation Index
> Fetch the complete documentation index at: https://docs.tavus.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Knowledge Base

> Upload documents to your knowledge base for personas to reference during conversations.

<Note>
  For now, our Knowledge Base only supports documents written in English and works best for conversations in English.

  We'll be expanding our Knowledge Base language support soon!
</Note>

Our Knowledge Base system uses RAG (Retrieval-Augmented Generation) to process and transform the contents of your documents and websites, allowing your personas to dynamically access and leverage information naturally during a conversation.

During a conversation, our persona will continuously analyze conversation content and pull relevant information from the documents that you have selected during conversation creation as added context.

## Getting Started With Your Knowledge Base

To leverage the Knowledge Base, you will need to upload documents or website URLs that you intend to reference from in conversations.
Let's walk through how to upload your documents and use them in a conversation.

<Note>
  You can either use our [Developer Portal](https://platform.tavus.io/documents) or API endpoints to upload and manage your documents.
  Our Knowledge Base supports creating documents from an uploaded file or a website URL.
</Note>

<Steps>
  <Step title="Step 1: Ensure Website Resources are Publicly Accessible" titleSize="h3">
    For any documents to be created via website URL, please make sure that each document is publicly accessible without requiring authorization, such as a pre-signed S3 link.

    For example, entering the URL in a browser should either:

    * Open the website you want to process and save contents from.
    * Open a document in a PDF viewer.
    * Download the document.
  </Step>

  <Step title="Step 2: Upload your Documents" titleSize="h3">
    You can create documents using either the [Developer Portal](https://platform.tavus.io/documents) or the [Create Document](https://docs.tavus.io/api-reference/documents/create-document) API endpoint.

    If you want to use the API, you can send a request to Tavus to upload your document.

    Here's an example of a `POST` request to `tavusapi.com/v2/documents`.

    ```json theme={null}
    {
        "document_name": "test-doc-1",
        "document_url": "https://your.document.pdf",
        "callback_url": "webhook-url-to-get-progress-updates" // Optional
    }
    ```

    The response from this POST request will include a `document_id` - a unique identifier for your uploaded document. When creating a conversation, you may include all `document_id` values that you would like the persona to have access to.

    Currently, we support the following file formats: .pdf, .txt, .docx, .doc, .png, .jpg, .pptx, .csv, and .xlsx.
  </Step>

  <Step title="Step 3: Document Processing" titleSize="h3">
    After your document is uploaded, it will be processed in the background automatically to allow for incredibly fast retrieval during conversations.
    This process can take 5-10 minutes depending on document size.

    During processing, if you have provided a `callback_url` in the [Create Document](https://docs.tavus.io/api-reference/documents/create-document) request body, you will receive periodic callbacks with status updates.
    You may also use the [Get Document](https://docs.tavus.io/api-reference/documents/get-document) endpoint to poll the most recent status of your documents.
  </Step>

  <Step title="Step 4: Create a conversation with the document" titleSize="h3">
    Once your documents have finished processing, you may use the `document_id` from Step 2 as part of the [Create Conversation](https://docs.tavus.io/api-reference/conversations/create-conversation) request.

    You can add multiple documents to a conversation within the `document_ids` object.

    ```json theme={null}
    {
      "persona_id": "your_persona_id",
      "replica_id": "your_replica_id",
      "document_ids": ["d1234567890", "d1234567891"]
    }
    ```

    During your conversation, the persona will be able to reference information from your documents in real time.
  </Step>
</Steps>

## Retrieval Strategy

When creating a conversation with documents, you can optimize how the system searches through your knowledge base by specifying a retrieval strategy. This strategy determines the balance between search speed and the quality of retrieved information, allowing you to fine-tune the system based on your specific needs.

You can choose from three different strategies:

* `speed`: Optimizes for faster retrieval times for minimal latency.
* `balanced`: Provides a balance between retrieval speed and quality.
* `quality` (default): Prioritizes finding the most relevant information, which may take slightly longer but can provide more accurate responses.

```json theme={null}
{
  "persona_id": "your_persona_id",
  "replica_id": "your_replica_id",
  "document_ids": ["d1234567890"],
  "document_retrieval_strategy": "balanced"
}
```

## Document Tags

If you have a lot of documents, maintaining long lists of `document_id` values can get tricky.
Instead of using distinct `document_ids`, you can also group documents together with shared tag values.
During the [Create Document](https://docs.tavus.io/api-reference/documents/create-document) API call, you may specify a value for `tags` for your document.
Then, when you create a conversation, you may specify the `tags` value instead of passing in discrete `document_id` values.

For example, if you are uploading course material, you could add the tag `"lesson-1"` to all documents that you want accessible in the first lesson.

```json theme={null}
{
        "document_name": "test-doc-1",
        "document_url": "https://your.document.pdf",
        "tags": ["lesson-1"]
}
```

In the [Create Conversation](https://docs.tavus.io/api-reference/conversations/create-conversation) request, you can add the tag value `lesson-1` to `document_tags` instead of individual `document_id` values.

```json theme={null}
{
  "persona_id": "your_persona_id",
  "replica_id": "your_replica_id",
  "document_tags": ["lesson-1"]
}
```

## Website Crawling

When adding a website to your knowledge base, you have two options:

### Single Page Scraping (Default)

By default, when you provide a website URL, only that single page is scraped and processed. This is ideal for:

* Landing pages with concentrated information
* Specific articles or blog posts
* Individual product pages

### Multi-Page Crawling

For comprehensive coverage of a website, you can enable **crawling** by providing a `crawl` configuration. This tells the system to start at your URL and follow links to discover and process additional pages.

```json theme={null}
{
  "document_name": "Company Docs",
  "document_url": "https://docs.example.com/",
  "crawl": {
    "depth": 2,
    "max_pages": 25
  }
}
```

#### Crawl Parameters

| Parameter   | Range | Description                                                                                                                                                                 |
| ----------- | ----- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `depth`     | 1-10  | How many link levels to follow from the starting URL. A depth of 1 crawls pages directly linked from your starting URL; depth of 2 follows links on those pages, and so on. |
| `max_pages` | 1-100 | Maximum number of pages to process. Crawling stops when this limit is reached.                                                                                              |

#### Crawl Limits

To ensure fair usage and system stability:

* Maximum **100 crawl documents** per account
* Maximum **5 concurrent crawls** at any time
* **1-hour cooldown** between recrawls of the same document

## Keeping Content Fresh

Website content changes over time, and you may need to update your knowledge base to reflect those changes. For documents created with crawl configuration, you can trigger a **recrawl** to fetch fresh content.

### Using the Recrawl Endpoint

Send a POST request to recrawl an existing document:

```bash theme={null}
POST https://tavusapi.com/v2/documents/{document_id}/recrawl
```

The recrawl will:

1. Use the same starting URL and crawl configuration
2. Replace old content with the new content
3. Update `last_crawled_at` and increment `crawl_count`

### Optionally Override Crawl Settings

You can provide new crawl settings when triggering a recrawl:

```json theme={null}
{
  "crawl": {
    "depth": 3,
    "max_pages": 50
  }
}
```

### Recrawl Requirements

* Document must be in `ready` or `error` state
* At least 1 hour must have passed since the last crawl
* Document must have been created with crawl configuration

See the [Recrawl Document API reference](/api-reference/documents/recrawl-document) for complete details.

## Best Practices for Documents

Following these guidelines will help your persona deliver accurate, consistent answers from your knowledge base.

### 1. Structure Content by Topic

Organize your documents so that each one covers a single topic, feature, or policy.

**Do:**

* Create one document per topic, feature, or policy.
* Use clear section headers (e.g., Overview, Steps, Limitations, Examples).
* Keep each document tightly focused on one subject.

**Avoid:**

* Large "master" documents that cover many unrelated topics.
* Mixing multiple policies or product areas in a single file.

<Tip>
  **Rule of thumb:** If a question can be answered by a single section of a larger document, that section should ideally be its own document.
</Tip>

### 2. Keep Documents Focused and Moderate in Size

Very large documents make it harder for the system to find the right information quickly.

* Split long manuals into logical sections before uploading.
* Separate policies, feature guides, and FAQs into distinct files.
* Prefer multiple focused documents over one comprehensive PDF.

Structuring your content upfront avoids the need to go back and manually break apart large files later.

### 3. Use High-Quality, Text-Based Sources

The knowledge base works best with content it can read as text.

**Best results:**

* Text-native PDFs (created digitally, not scanned)
* Structured web content
* Clearly formatted `.docx` or `.txt` files

**Lower reliability:**

* Scanned or image-based documents (text recognition can introduce errors)
* Dense tables with critical information embedded inside them

Whenever possible, provide the original text-based file rather than a scan or screenshot.

### 4. Be Explicit and Complete

The system can only retrieve information that is explicitly written in your documents. If something is not stated clearly, the persona may not be able to surface it.

Make sure your documents include:

* Definitions and terminology
* Constraints and prerequisites
* Exceptions and edge cases
* Common variations in phrasing (e.g., both acronyms and their full forms)

If something is business-critical, state it clearly and directly in your documents.

### 5. Avoid Conflicting or Duplicate Sources

When multiple documents say slightly different things about the same topic, the persona may return inconsistent answers.

* Maintain a single source of truth for each policy or topic.
* Archive outdated versions instead of keeping them alongside current ones.
* Avoid uploading drafts next to finalized documents.

### 6. Know When to Use Persona Instructions Instead

If certain content must appear in every response — such as required legal language or mandatory messaging — document retrieval alone may not guarantee its inclusion.

In these cases, incorporate that critical content directly into your [persona's instructions](/sections/conversational-video-interface/persona/overview) rather than relying solely on the knowledge base.

***

## Troubleshooting

If your persona's answers are inconsistent or incomplete, review the following:

* **Is the information buried in a very large document?** Try splitting it into smaller, focused files.
* **Are multiple documents providing conflicting guidance?** Consolidate to a single source of truth.
* **Is key information embedded in tables or images?** Convert it to structured text for better results.
* **Is the information clearly written in the document at all?** The system can only retrieve what is explicitly stated.
* **Should this content appear in every response?** If so, add it to your persona's instructions instead.

***

## Quick Setup Checklist

* One topic per document
* No large "all-in-one" manuals
* Text-based documents (avoid scans when possible)
* Clear headings and definitions
* No duplicate or conflicting sources