Skip to main content
You can create a replica by uploading training video or training image assets. Use the Create Replica API with either train_video_url or train_image_url (mutually exclusive). You may also use the Developer Portal, which walks through the same choices.

Compare video and image training

Each path has different time and quality trade-offs.

Record or upload video

Record or upload a training video for a replica with your personalized facial expressions.train_video_url in the API.
  • Captures your facial expressions
  • Highest fidelity output
  • Best for customer-facing use

Upload image

Upload a photo and select a voice. Quicker to set up with a more generalized output.train_image_url and voice_name in the API.
  • Fastest to set up
  • No webcam required
  • Generalized facial expressions

Training from video

The following instructions have changed to work best for Phoenix-4 (new default model).Here are the KEY DIFFERENCES:
  • Recording time reduced to 30 seconds of speaking + 30 seconds of still footage
  • Listening segment must be fully neutral with lips closed the entire time
  • Neck and jawline must be fully visible with clear clothing separation and hair kept off the face and neck
  • Teeth must be clearly visible during speaking with strong articulation
  • Framing must be stable, waist-up, seated, with minimal movement
Image to Replica: you can train from a headshot and a stock voice_name instead of video—see Training from image.

Talking Head Replica

To ensure the highest quality Phoenix-4 replica, your training video must follow the specifications outlined below.

Environment

  • Record in a quiet, well-lit space with no background noise or movement.
  • Use diffuse lighting to avoid shadows on your face.
  • Choose a simple background and avoid any moving people or objects.

Camera

  • Place the camera at eye level and ensure your face fills at least 25% of the frame.
  • Use a desktop recording app (e.g., QuickTime on Mac or Camera on Windows) — avoid browser-based tools.
  • Minimum resolution: 1080p. Anything lower may negatively impact replica quality.

Microphone

  • Use your device’s built-in microphone.
  • Avoid high-end mics or wireless earbuds like AirPods.
  • Turn off audio effects like noise suppression or EQ adjustments.

Framing & Distance

Your framing should resemble a natural Zoom-style call. Positioning
  • Record from the waist up
  • Be seated at a desk or table
  • Position yourself at least 3 feet from the camera to avoid being too close to the lens
Camera Setup
  • Camera should be stable (no handheld movement)
  • Face centered in frame
  • Head, shoulders, and upper chest clearly visible

Yourself

✅ Do❌ Don’t
Keep your full head visible, with a clear view of your faceWear clothes that blend into the background
Ensure your face and upper body are in sharp focusWear accessories like necklaces, hats, glasses, scarves, or earrings
If using smartphone, make sure you follow the same framing/distance from the cameraTurn your head away from the camera
Keep longer hair behind shoulders, and tuck in any loose strands in front of the faceBlock your chin or mouth with your microphone
Sit upright in a stable, seated positionStand or shift positions during the video

Head & Clothing Separation

There must be a clear visual distinction between your head and clothing, and your neck fully visible.
  • No overlap between neck and clothing
  • Avoid high collars or obstructive clothing
  • Ensure the jawline and neck are fully visible

Hair Guidelines

  • Avoid complex hairstyles
  • No bangs covering the forehead
  • Tuck or pin loose strands
  • Longer hair must fall behind the shoulders
  • Hair should not obscure the face, neck, or shoulders

Video Format

If you’re uploading a pre-recorded training video via our API , ensure it meets the following requirements:
  • Minimum FPS: 25 fps
  • Accepted formats:
    • webm
    • mp4 with H.264 video codec and AAC audio codec
  • Maximum file size: 750MB
  • Minimum resolution: 1080p (lower may negatively impact replica quality)
If you’re creating a personal replica, you must include a verbal consent statement in the video. This ensures ethical use and compliance with data protection laws. Consent is not required for AI-generated training videos. Say the following script clearly in your video:
I, (your name), am currently speaking and give consent to Tavus to create an AI clone of me by using the audio and video samples I provide. I understand that this AI clone can be used to create videos that look and sound like me.
Consent is only required for personal replicas. If you’re creating an AI replica or using AI-generated training video, you can skip this.

Recording Structure

Your video must be one continuous shot, containing 30 seconds of speaking followed by 30 seconds of still footage. You can use a script provided by Tavus or speak on any topic of your choice.
Pro tips:
  • Keep body and head movements subtle
  • Avoid heavy hand gestures
  • Only one person should appear in the video
1

Opening

  • Begin with a big smile showing upper and lower teeth
  • Maintain direct eye contact with the camera for approximately 1 second
2

Speaking Segment (30 Seconds)

  • Speak on any topic — content does not matter
  • Open your mouth clearly when speaking
  • Enunciate well, ensuring all teeth are fully visible
  • Keep visible space between your top and bottom teeth
  • Keep head and body movement minimal
  • Avoid hand gestures
  • Avoid sudden head turns
Sample script (optional):
Once upon a time, people built a perfect park in the middle of a busy city. This park was big, bright, and full of playful paths. At sunrise, birds sang above the tall trees. Families carried baskets packed with bread, fruit, and juice.

Children skipped and shouted, chasing balls and flying paper kites. In the afternoon, people played games. Some tapped paddles and bounced plastic balls. Others kicked soccer balls back and forth, laughing loudly with every point scored.
3

Still Segment (30 Seconds)

  • Keep your head completely still
  • Keep lips neutral and closed throughout
  • Maintain direct eye contact with the camera
  • Do not lick lips or form unusual mouth shapes
  • Avoid any head tilting or movement
Replica training typically takes 4–5 hours. You can track the training progress by:

High-Quality Training Example

Full Body Replica

To create a full body replica for conversational video, follow these guidelines:

Framing & Orientation

  • The subject must be captured from head to toe, with no extra space above or below.
  • Record in vertical format (portrait mode) or crop appropriately to maintain vertical framing.

Posture & Movement

  • Remain standing still throughout the recording.
  • Avoid hand gestures or exaggerated body movements to maintain consistency and model quality.

Resolution & Quality

  • A 4K resolution is recommended for best results.
  • Ensure consistent lighting, with no shadows or sudden changes in exposure.

Training from image

Use this path when you call Create Replica with train_image_url and voice_name. The image file must be reachable at a publicly accessible URL (for example a presigned S3 GET URL), same as for video uploads.

File format and resolution

  • Formats: JPG or PNG
  • Minimum resolution: 512×512 pixels (larger is fine if aspect and quality are good)

Image composition and quality

Upload a clear, front-facing headshot. For best results, follow these guidelines: Requirements
  • Composition: Front-facing; head and shoulders visible in frame
  • Subject: Only one person in the photo
  • Accessories: No glasses, hats, or face-covering accessories
  • Jewelry: No prominent earrings or visible necklaces
  • Hair: Hair behind the shoulders or tied up so the face and neck are clear
  • Lighting: Even lighting; avoid heavy shadows across the face

How voice_name works

Image-based training does not create a new voice from your source material. Instead, you must set voice_name to a stock voice identifier slug (for example anna). This selects a voice tied to an existing Tavus stock replica so the trained replica has a usable default voice.
When you run Conversational Video Interface (CVI) sessions later, you are not locked into that stock voice for every conversation. You can attach a persona whose TTS layer uses an external voice (from Cartesia or ElevenLabs). See Text-to-Speech (TTS) for how to set external_voice_id and related fields.

Example voice_name values

Below are example voice_name slugs with a short sample clip for each.
benjamin
james
liam
anna
julia
ivy
By using the image training API, you affirm that you have the rights to use the image you supply (for example likeness and publicity rights where applicable). Tavus may reject images that appear to depict unauthorized or impermissible subjects. This is separate from the verbal consent requirement for personal replicas trained from video; see Consent Statement in the video section when using train_video_url.