train_video_url or train_image_url (mutually exclusive). You may also use the Developer Portal, which walks through the same choices.
Compare video and image training
Each path has different time and quality trade-offs.Record or upload video
Record or upload a training video for a replica with your personalized facial expressions.
train_video_url in the API.- Captures your facial expressions
- Highest fidelity output
- Best for customer-facing use
Upload image
Upload a photo and select a voice. Quicker to set up with a more generalized output.
train_image_url and voice_name in the API.- Fastest to set up
- No webcam required
- Generalized facial expressions
Training from video
The following instructions have changed to work best for Phoenix-4 (new default model).Here are the KEY DIFFERENCES:
- Recording time reduced to 30 seconds of speaking + 30 seconds of still footage
- Listening segment must be fully neutral with lips closed the entire time
- Neck and jawline must be fully visible with clear clothing separation and hair kept off the face and neck
- Teeth must be clearly visible during speaking with strong articulation
- Framing must be stable, waist-up, seated, with minimal movement
voice_name instead of video—see Training from image.Talking Head Replica
To ensure the highest quality Phoenix-4 replica, your training video must follow the specifications outlined below.Environment
- Record in a quiet, well-lit space with no background noise or movement.
- Use diffuse lighting to avoid shadows on your face.
- Choose a simple background and avoid any moving people or objects.
Camera
- Place the camera at eye level and ensure your face fills at least 25% of the frame.
- Use a desktop recording app (e.g., QuickTime on Mac or Camera on Windows) — avoid browser-based tools.
- Minimum resolution: 1080p. Anything lower may negatively impact replica quality.
Microphone
- Use your device’s built-in microphone.
- Avoid high-end mics or wireless earbuds like AirPods.
- Turn off audio effects like noise suppression or EQ adjustments.
Framing & Distance
Your framing should resemble a natural Zoom-style call. Positioning- Record from the waist up
- Be seated at a desk or table
- Position yourself at least 3 feet from the camera to avoid being too close to the lens
- Camera should be stable (no handheld movement)
- Face centered in frame
- Head, shoulders, and upper chest clearly visible
Yourself

| ✅ Do | ❌ Don’t |
|---|---|
| Keep your full head visible, with a clear view of your face | Wear clothes that blend into the background |
| Ensure your face and upper body are in sharp focus | Wear accessories like necklaces, hats, glasses, scarves, or earrings |
| If using smartphone, make sure you follow the same framing/distance from the camera | Turn your head away from the camera |
| Keep longer hair behind shoulders, and tuck in any loose strands in front of the face | Block your chin or mouth with your microphone |
| Sit upright in a stable, seated position | Stand or shift positions during the video |
Head & Clothing Separation
There must be a clear visual distinction between your head and clothing, and your neck fully visible.
- No overlap between neck and clothing
- Avoid high collars or obstructive clothing
- Ensure the jawline and neck are fully visible
Hair Guidelines
- Avoid complex hairstyles
- No bangs covering the forehead
- Tuck or pin loose strands
- Longer hair must fall behind the shoulders
- Hair should not obscure the face, neck, or shoulders
Video Format
If you’re uploading a pre-recorded training video via our API , ensure it meets the following requirements:- Minimum FPS: 25 fps
- Accepted formats:
webmmp4with H.264 video codec and AAC audio codec
- Maximum file size: 750MB
- Minimum resolution: 1080p (lower may negatively impact replica quality)
Consent Statement
If you’re creating a personal replica, you must include a verbal consent statement in the video. This ensures ethical use and compliance with data protection laws. Consent is not required for AI-generated training videos. Say the following script clearly in your video:I, (your name), am currently speaking and give consent to Tavus to create an AI clone of me by using the audio and video samples I provide. I understand that this AI clone can be used to create videos that look and sound like me.
Consent is only required for personal replicas. If you’re creating an AI replica or using AI-generated training video, you can skip this.
Recording Structure
Your video must be one continuous shot, containing 30 seconds of speaking followed by 30 seconds of still footage. You can use a script provided by Tavus or speak on any topic of your choice.Opening
- Begin with a big smile showing upper and lower teeth
- Maintain direct eye contact with the camera for approximately 1 second
Speaking Segment (30 Seconds)
- Speak on any topic — content does not matter
- Open your mouth clearly when speaking
- Enunciate well, ensuring all teeth are fully visible
- Keep visible space between your top and bottom teeth
- Keep head and body movement minimal
- Avoid hand gestures
- Avoid sudden head turns

Replica training typically takes 4–5 hours. You can track the training progress by:
-
Providing a
callback_urlwhen creating the replica via API - Using the Get Replica Status API
- Checking the Developer Portal
High-Quality Training Example
Full Body Replica
To create a full body replica for conversational video, follow these guidelines:
Framing & Orientation
- The subject must be captured from head to toe, with no extra space above or below.
- Record in vertical format (portrait mode) or crop appropriately to maintain vertical framing.
Posture & Movement
- Remain standing still throughout the recording.
- Avoid hand gestures or exaggerated body movements to maintain consistency and model quality.
Resolution & Quality
- A 4K resolution is recommended for best results.
- Ensure consistent lighting, with no shadows or sudden changes in exposure.
Training from image
Use this path when you call Create Replica withtrain_image_url and voice_name. The image file must be reachable at a publicly accessible URL (for example a presigned S3 GET URL), same as for video uploads.
File format and resolution
- Formats: JPG or PNG
- Minimum resolution: 512×512 pixels (larger is fine if aspect and quality are good)
Image composition and quality
Upload a clear, front-facing headshot. For best results, follow these guidelines: Requirements- Composition: Front-facing; head and shoulders visible in frame
- Subject: Only one person in the photo
- Accessories: No glasses, hats, or face-covering accessories
- Jewelry: No prominent earrings or visible necklaces
- Hair: Hair behind the shoulders or tied up so the face and neck are clear
- Lighting: Even lighting; avoid heavy shadows across the face
How voice_name works
Image-based training does not create a new voice from your source material. Instead, you must set voice_name to a stock voice identifier slug (for example anna). This selects a voice tied to an existing Tavus stock replica so the trained replica has a usable default voice.
When you run Conversational Video Interface (CVI) sessions later, you are not locked into that stock voice for every conversation. You can attach a persona whose TTS layer uses an external voice (from Cartesia or ElevenLabs). See Text-to-Speech (TTS) for how to set
external_voice_id and related fields.Example voice_name values
Below are example voice_name slugs with a short sample clip for each.
Consent, rights, and acceptable use
By using the image training API, you affirm that you have the rights to use the image you supply (for example likeness and publicity rights where applicable). Tavus may reject images that appear to depict unauthorized or impermissible subjects. This is separate from the verbal consent requirement for personal replicas trained from video; see Consent Statement in the video section when usingtrain_video_url.

