Skip to main content
POST
/
v2
/
documents
Create Document
curl --request POST \
  --url https://tavusapi.com/v2/documents \
  --header 'Content-Type: application/json' \
  --header 'x-api-key: <api-key>' \
  --data '
{
  "document_url": "https://docs.example.com/",
  "document_name": "Example Docs",
  "callback_url": "https://your-server.com/webhook",
  "tags": [
    "docs",
    "website"
  ],
  "crawl": {
    "depth": 2,
    "max_pages": 10
  }
}
'
{
  "document_id": "d8-5c71baca86fc",
  "document_name": "Example Docs",
  "document_url": "https://docs.example.com/",
  "status": "started",
  "progress": null,
  "created_at": "2024-01-01T12:00:00Z",
  "updated_at": "2024-01-01T12:00:00Z",
  "callback_url": "https://your-server.com/webhook",
  "tags": [
    "docs",
    "website"
  ],
  "crawl_config": {
    "depth": 2,
    "max_pages": 10
  },
  "crawled_urls": [
    "https://docs.example.com/",
    "https://docs.example.com/getting-started",
    "https://docs.example.com/api"
  ],
  "last_crawled_at": "2024-01-01T12:00:00Z",
  "crawl_count": 1
}
For now, our Knowledge Base only supports documents written in English and works best for conversations in English.We’ll be expanding our Knowledge Base language support soon!
Create a new document in your Knowledge Base. When you hit this endpoint, Tavus kicks off the processing of the document, so it can be used as part of your knowledge base in conversations once processing is complete. The file size limit is 50MB. The processing can take up to a few minutes depending on file size. Currently, we support the following file formats: .pdf, .txt, .docx, .doc, .png, .jpg, .pptx, .csv, and .xlsx. Website URLs are also supported, where a website snapshot will be processed and transformed into a document. You can manage documents by adding tags using the tags field in the request body. Once created, you can add the document to your personas (see Create Persona) and your conversations (see Create Conversation).

Website Crawling

When creating a document from a website URL, you can optionally enable multi-page crawling by providing the crawl parameter. This allows the system to follow links from your starting URL and process multiple pages into a single document.

Without Crawling (Default)

By default, only the single page at the provided URL is scraped and processed.

With Crawling

When you include the crawl object, the system will:
  1. Start at your provided URL
  2. Follow links to discover additional pages
  3. Process all discovered pages into a single document
Example request with crawling enabled:
{
  "document_name": "Company Knowledge Base",
  "document_url": "https://docs.example.com/",
  "crawl": {
    "depth": 2,
    "max_pages": 20
  },
  "callback_url": "https://your-server.com/webhook"
}

Crawl Parameters

ParameterTypeDescription
depthinteger (1-10)How many levels deep to follow links from the starting URL. A depth of 1 means only pages directly linked from the starting URL.
max_pagesinteger (1-100)Maximum number of pages to crawl. Processing stops once this limit is reached.

Rate Limits

To prevent abuse, crawling has the following limits:
  • Maximum 100 crawl documents per user
  • Maximum 5 concurrent crawls at any time
  • 1-hour cooldown between recrawls of the same document

Keeping Content Fresh

Once a document is created with crawl configuration, you can trigger a recrawl to fetch fresh content using the Recrawl Document endpoint.

Authorizations

x-api-key
string
header
required

Body

application/json
document_url
string
required

The URL of the document or website to be processed

Example:

"https://docs.example.com/"

document_name
string

Optional name for the document. If not provided, a default name will be generated.

Example:

"Example Docs"

callback_url
string

Optional URL that will receive status updates about the document processing

Example:

"https://your-server.com/webhook"

tags
string[]

Optional array of tags to categorize the document

Example:
["docs", "website"]
crawl
object

Optional configuration for website crawling. When provided with a website URL, the system will follow links from the starting URL and process multiple pages. Without this parameter, only the single page at the URL is scraped.

Response

Document created successfully

document_id
string

Unique identifier for the created document

Example:

"d8-5c71baca86fc"

document_name
string

Name of the document

Example:

"Example Docs"

document_url
string

URL of the document or website

Example:

"https://docs.example.com/"

status
string

Current status of the document processing

Example:

"started"

progress
string | null

Progress indicator for document processing

Example:

null

created_at
string

ISO 8601 timestamp of when the document was created

Example:

"2024-01-01T12:00:00Z"

updated_at
string

ISO 8601 timestamp of when the document was last updated

Example:

"2024-01-01T12:00:00Z"

callback_url
string

URL that will receive status updates

Example:

"https://your-server.com/webhook"

tags
string[]

Array of document tags

Example:
["docs", "website"]
crawl_config
object

The crawl configuration used for this document (only present for crawled websites)

crawled_urls
string[] | null

List of URLs that were crawled (only present for crawled websites after processing completes)

Example:
[
"https://docs.example.com/",
"https://docs.example.com/getting-started",
"https://docs.example.com/api"
]
last_crawled_at
string | null

ISO 8601 timestamp of when the document was last crawled

Example:

"2024-01-01T12:00:00Z"

crawl_count
integer | null

Number of times the document has been crawled

Example:

1