The following instructions have changed to work best for Phoenix-4 (new default model).Here are the KEY DIFFERENCES:
- Listening minute must be fully neutral with lips closed the entire time
- Neck and jawline must be fully visible with clear clothing separation and hair kept off the face and neck
- Teeth must be clearly visible during speaking with strong articulation
- Framing must be stable, waist-up, seated, with minimal movement
model_name to phoenix-3.Talking Head Replica
To ensure the highest quality Phoenix-4 replica, your training video must follow the specifications outlined below.Environment
- Record in a quiet, well-lit space with no background noise or movement.
- Use diffuse lighting to avoid shadows on your face.
- Choose a simple background and avoid any moving people or objects.
Camera
- Place the camera at eye level and ensure your face fills at least 25% of the frame.
- Use a desktop recording app (e.g., QuickTime on Mac or Camera on Windows) — avoid browser-based tools.
- Minimum resolution: 1080p. Anything lower may negatively impact replica quality.
Microphone
- Use your device’s built-in microphone.
- Avoid high-end mics or wireless earbuds like AirPods.
- Turn off audio effects like noise suppression or EQ adjustments.
Framing & Distance
Your framing should resemble a natural Zoom-style call. Positioning- Record from the waist up
- Be seated at a desk or table
- Position yourself at least 3 feet from the camera to avoid being too close to the lens
- Camera should be stable (no handheld movement)
- Face centered in frame
- Head, shoulders, and upper chest clearly visible
Yourself

| ✅ Do | ❌ Don’t |
|---|---|
| Keep your full head visible, with a clear view of your face | Wear clothes that blend into the background |
| Ensure your face and upper body are in sharp focus | Wear accessories like necklaces, hats, glasses, scarves, or earrings |
| If using smartphone, make sure you follow the same framing/distance from the camera | Turn your head away from the camera |
| Keep longer hair behind shoulders, and tuck in any loose strands in front of the face | Block your chin or mouth with your microphone |
| Sit upright in a stable, seated position | Stand or shift positions during the video |
Head & Clothing Separation
There must be a clear visual distinction between your head and clothing, and your neck fully visible.
- No overlap between neck and clothing
- Avoid high collars or obstructive clothing
- Ensure the jawline and neck are fully visible
Hair Guidelines
- Avoid complex hairstyles
- No bangs covering the forehead
- Tuck or pin loose strands
- Longer hair must fall behind the shoulders
- Hair should not obscure the face, neck, or shoulders
Video Format
If you’re uploading a pre-recorded training video via our API , ensure it meets the following requirements:- Minimum FPS: 25 fps
- Accepted formats:
webmmp4with H.264 video codec and AAC audio codec
- Maximum file size: 750MB
- Minimum resolution: 1080p (lower may negatively impact replica quality)
Consent Statement
If you’re creating a personal replica, you must include a verbal consent statement in the video. This ensures ethical use and compliance with data protection laws. Consent is not required for AI-generated training videos. Say the following script clearly in your video:I, (your name), am currently speaking and give consent to Tavus to create an AI clone of me by using the audio and video samples I provide. I understand that this AI clone can be used to create videos that look and sound like me.
Consent is only required for personal replicas. If you’re creating an AI replica or using AI-generated training video, you can skip this.
Recording Structure
Your video must be one continuous shot, containing 1 minute of speaking followed by 1 minute of listening. You can use a script provided by Tavus or speak on any topic of your choice.Opening
- Begin with a big smile showing upper and lower teeth
- Maintain direct eye contact with the camera for approximately 1 second
Speaking Segment (1 Minute)
- Speak on any topic — content does not matter
- Open your mouth clearly when speaking
- Enunciate well, ensuring all teeth are fully visible
- Keep visible space between your top and bottom teeth
- Keep head and body movement minimal
- Avoid hand gestures
- Avoid sudden head turns

Replica training typically takes 4–5 hours. You can track the training progress by:
-
Providing a
callback_urlwhen creating the replica via API - Using the Get Replica Status API
- Checking the Developer Portal
High-Quality Training Example
Full Body Replica
To create a full body replica for conversational video, follow these guidelines:
Framing & Orientation
- The subject must be captured from head to toe, with no extra space above or below.
- Record in vertical format (portrait mode) or crop appropriately to maintain vertical framing.
Posture & Movement
- Remain standing still throughout the recording.
- Avoid hand gestures or exaggerated body movements to maintain consistency and model quality.
Resolution & Quality
- A 4K resolution is recommended for best results.
- Ensure consistent lighting, with no shadows or sudden changes in exposure.


