Recrawl Document

curl --request POST \
  --url https://tavusapi.com/v2/documents/{document_id}/recrawl \
  --header 'Content-Type: application/json' \
  --header 'x-api-key: <api-key>' \
  --data '
{
  "crawl": {
    "depth": 2,
    "max_pages": 10
  }
}
'

{
  "document_id": "d8-5c71baca86fc",
  "document_name": "Company Website",
  "document_url": "https://example.com/",
  "status": "recrawling",
  "progress": null,
  "error_message": "<string>",
  "created_at": "2024-01-01T12:00:00Z",
  "updated_at": "2024-01-15T10:30:00Z",
  "callback_url": "https://your-server.com/webhook",
  "tags": [
    "website",
    "company"
  ],
  "crawl_config": {
    "depth": 2,
    "max_pages": 10
  },
  "crawled_urls": [
    "https://docs.example.com/",
    "https://docs.example.com/getting-started"
  ],
  "last_crawled_at": "2024-01-01T12:05:00Z",
  "crawl_count": 1
}

POST

documents

{document_id}

recrawl

Recrawl Document

curl --request POST \
  --url https://tavusapi.com/v2/documents/{document_id}/recrawl \
  --header 'Content-Type: application/json' \
  --header 'x-api-key: <api-key>' \
  --data '
{
  "crawl": {
    "depth": 2,
    "max_pages": 10
  }
}
'

{
  "document_id": "d8-5c71baca86fc",
  "document_name": "Company Website",
  "document_url": "https://example.com/",
  "status": "recrawling",
  "progress": null,
  "error_message": "<string>",
  "created_at": "2024-01-01T12:00:00Z",
  "updated_at": "2024-01-15T10:30:00Z",
  "callback_url": "https://your-server.com/webhook",
  "tags": [
    "website",
    "company"
  ],
  "crawl_config": {
    "depth": 2,
    "max_pages": 10
  },
  "crawled_urls": [
    "https://docs.example.com/",
    "https://docs.example.com/getting-started"
  ],
  "last_crawled_at": "2024-01-01T12:05:00Z",
  "crawl_count": 1
}

Trigger a recrawl of a document that was created with crawl configuration. This is useful for keeping your knowledge base up-to-date when website content changes.

When to Recrawl

Use this endpoint when:

The source website has been updated with new content
You want to refresh the document’s content on a schedule
The initial crawl encountered errors and you want to retry

How Recrawling Works

When you trigger a recrawl:

The system uses the same starting URL from the original document
Links are followed according to the crawl configuration (depth and max_pages)
New content is processed and stored
Old vectors are replaced with the new content once processing completes
The document’s crawl_count is incremented and last_crawled_at is updated

Requirements

Document State: The document must be in ready or error state
Crawl Configuration: The document must have been created with a crawl configuration, or you must provide one in the request body

Rate Limits

To prevent abuse, the following limits apply:

Cooldown Period: 1 hour between recrawls of the same document
Concurrent Crawls: Maximum 5 crawls running simultaneously per user
Total Documents: Maximum 100 crawl documents per user

Overriding Crawl Configuration

You can optionally provide a crawl object in the request body to override the stored configuration for this recrawl:

{
  "crawl": {
    "depth": 3,
    "max_pages": 50
  }
}

If no crawl object is provided, the original crawl configuration from document creation is used.

Monitoring Recrawl Progress

After initiating a recrawl:

The document status changes to recrawling
If you provided a callback_url during document creation, you’ll receive status updates
When complete, the status changes to ready (or error if it failed)
Use Get Document to check the current status

Authorizations

x-api-key

string

header

required

Path Parameters

document_id

string

required

The unique identifier of the document to recrawl

Body

application/json

crawl

object

Optional crawl configuration to override the stored settings. If not provided, the original crawl configuration will be used.

Show child attributes

Response

Recrawl initiated successfully

document_id

string

Unique identifier for the document

Example:

"d8-5c71baca86fc"

document_name

string

Name of the document

Example:

"Company Website"

document_url

string

URL of the document

Example:

"https://example.com/"

status

enum<string>

Current status of the document processing. Possible values: started, processing, ready, error, recrawling.

Available options:

started,

processing,

ready,

error,

recrawling

Example:

"recrawling"

progress

integer | null

Processing progress as a percentage (0-100). Null when processing has not started or is complete.

Example:

null

error_message

string | null

Descriptive error message when processing fails. Only present when status is error. Common reasons include: document URL is unreachable, content format is unsupported, parsing failed, or rate limits exceeded.

created_at

string

ISO 8601 timestamp of when the document was created

Example:

"2024-01-01T12:00:00Z"

updated_at

string

ISO 8601 timestamp of when the document was last updated

Example:

"2024-01-15T10:30:00Z"

callback_url

string

URL that will receive status updates

Example:

"https://your-server.com/webhook"

Introduction

Conversation

Persona

Replica

Objectives

Guardrails

Knowledge Base

Video Generation

Recrawl Document

When to Recrawl

How Recrawling Works

Requirements

Rate Limits

Overriding Crawl Configuration

Monitoring Recrawl Progress

Authorizations

Path Parameters

Body

Response

Introduction

Conversation

Persona

Replica

Objectives

Guardrails

Knowledge Base

Video Generation

​When to Recrawl

​How Recrawling Works

​Requirements

​Rate Limits

​Overriding Crawl Configuration

​Monitoring Recrawl Progress

Authorizations

Path Parameters

Body

Response

When to Recrawl

How Recrawling Works

Requirements

Rate Limits

Overriding Crawl Configuration

Monitoring Recrawl Progress