Skip to main content
POST
/
v2
/
documents
/
{document_id}
/
recrawl
Recrawl Document
curl --request POST \
  --url https://tavusapi.com/v2/documents/{document_id}/recrawl \
  --header 'Content-Type: application/json' \
  --header 'x-api-key: <api-key>' \
  --data '
{
  "crawl": {
    "depth": 2,
    "max_pages": 10
  }
}
'
{
  "document_id": "d8-5c71baca86fc",
  "document_name": "Company Website",
  "document_url": "https://example.com/",
  "status": "recrawling",
  "progress": null,
  "created_at": "2024-01-01T12:00:00Z",
  "updated_at": "2024-01-15T10:30:00Z",
  "callback_url": "https://your-server.com/webhook",
  "tags": [
    "website",
    "company"
  ],
  "properties": {},
  "crawl_config": {
    "depth": 2,
    "max_pages": 10
  },
  "crawled_urls": [
    "https://docs.example.com/",
    "https://docs.example.com/getting-started"
  ],
  "last_crawled_at": "2024-01-01T12:05:00Z",
  "crawl_count": 1
}
Trigger a recrawl of a document that was created with crawl configuration. This is useful for keeping your knowledge base up-to-date when website content changes.

When to Recrawl

Use this endpoint when:
  • The source website has been updated with new content
  • You want to refresh the document’s content on a schedule
  • The initial crawl encountered errors and you want to retry

How Recrawling Works

When you trigger a recrawl:
  1. The system uses the same starting URL from the original document
  2. Links are followed according to the crawl configuration (depth and max_pages)
  3. New content is processed and stored
  4. Old vectors are replaced with the new content once processing completes
  5. The document’s crawl_count is incremented and last_crawled_at is updated

Requirements

  • Document State: The document must be in ready or error state
  • Crawl Configuration: The document must have been created with a crawl configuration, or you must provide one in the request body

Rate Limits

To prevent abuse, the following limits apply:
  • Cooldown Period: 1 hour between recrawls of the same document
  • Concurrent Crawls: Maximum 5 crawls running simultaneously per user
  • Total Documents: Maximum 100 crawl documents per user

Overriding Crawl Configuration

You can optionally provide a crawl object in the request body to override the stored configuration for this recrawl:
{
  "crawl": {
    "depth": 3,
    "max_pages": 50
  }
}
If no crawl object is provided, the original crawl configuration from document creation is used.

Monitoring Recrawl Progress

After initiating a recrawl:
  1. The document status changes to recrawling
  2. If you provided a callback_url during document creation, you’ll receive status updates
  3. When complete, the status changes to ready (or error if it failed)
  4. Use Get Document to check the current status

Authorizations

x-api-key
string
header
required

Path Parameters

document_id
string
required

The unique identifier of the document to recrawl

Body

application/json
crawl
object

Optional crawl configuration to override the stored settings. If not provided, the original crawl configuration will be used.

Response

Recrawl initiated successfully

document_id
string

Unique identifier for the document

Example:

"d8-5c71baca86fc"

document_name
string

Name of the document

Example:

"Company Website"

document_url
string

URL of the document

Example:

"https://example.com/"

status
string

Current status of the document (will be 'recrawling')

Example:

"recrawling"

progress
integer | null

Progress indicator for document processing

Example:

null

created_at
string

ISO 8601 timestamp of when the document was created

Example:

"2024-01-01T12:00:00Z"

updated_at
string

ISO 8601 timestamp of when the document was last updated

Example:

"2024-01-15T10:30:00Z"

callback_url
string

URL that will receive status updates

Example:

"https://your-server.com/webhook"

tags
string[]

Array of document tags

Example:
["website", "company"]
properties
object

Additional document properties

Example:
{}
crawl_config
object

The crawl configuration being used for the recrawl

crawled_urls
string[] | null

List of URLs from the previous crawl (will be updated when recrawl completes)

Example:
[
"https://docs.example.com/",
"https://docs.example.com/getting-started"
]
last_crawled_at
string | null

ISO 8601 timestamp of the previous crawl

Example:

"2024-01-01T12:05:00Z"

crawl_count
integer

Number of times the document has been crawled (will increment when recrawl completes)

Example:

1