Skip to main content
You can record the Replica training video directly in the Developer Portal or upload a pre-recorded one via the API.
The following instructions have changed to work best for Phoenix-4 (new default model).Here are the KEY DIFFERENCES:
  • Listening minute must be fully neutral with lips closed the entire time
  • Neck and jawline must be fully visible with clear clothing separation and hair kept off the face and neck
  • Teeth must be clearly visible during speaking with strong articulation
  • Framing must be stable, waist-up, seated, with minimal movement
Phoenix-4 is a more precise model and requires high quality training footage to yield the best results, whereas Phoenix-3 has a slightly higher tolerance. To train on Phoenix-3, set model_name to phoenix-3.

Talking Head Replica

To ensure the highest quality Phoenix-4 replica, your training video must follow the specifications outlined below.

Environment

  • Record in a quiet, well-lit space with no background noise or movement.
  • Use diffuse lighting to avoid shadows on your face.
  • Choose a simple background and avoid any moving people or objects.

Camera

  • Place the camera at eye level and ensure your face fills at least 25% of the frame.
  • Use a desktop recording app (e.g., QuickTime on Mac or Camera on Windows) — avoid browser-based tools.
  • Minimum resolution: 1080p. Anything lower may negatively impact replica quality.

Microphone

  • Use your device’s built-in microphone.
  • Avoid high-end mics or wireless earbuds like AirPods.
  • Turn off audio effects like noise suppression or EQ adjustments.

Framing & Distance

Your framing should resemble a natural Zoom-style call. Positioning
  • Record from the waist up
  • Be seated at a desk or table
  • Position yourself at least 3 feet from the camera to avoid being too close to the lens
Camera Setup
  • Camera should be stable (no handheld movement)
  • Face centered in frame
  • Head, shoulders, and upper chest clearly visible

Yourself

✅ Do❌ Don’t
Keep your full head visible, with a clear view of your faceWear clothes that blend into the background
Ensure your face and upper body are in sharp focusWear accessories like necklaces, hats, glasses, scarves, or earrings
If using smartphone, make sure you follow the same framing/distance from the cameraTurn your head away from the camera
Keep longer hair behind shoulders, and tuck in any loose strands in front of the faceBlock your chin or mouth with your microphone
Sit upright in a stable, seated positionStand or shift positions during the video

Head & Clothing Separation

There must be a clear visual distinction between your head and clothing, and your neck fully visible.
  • No overlap between neck and clothing
  • Avoid high collars or obstructive clothing
  • Ensure the jawline and neck are fully visible

Hair Guidelines

  • Avoid complex hairstyles
  • No bangs covering the forehead
  • Tuck or pin loose strands
  • Longer hair must fall behind the shoulders
  • Hair should not obscure the face, neck, or shoulders

Video Format

If you’re uploading a pre-recorded training video via our API , ensure it meets the following requirements:
  • Minimum FPS: 25 fps
  • Accepted formats:
    • webm
    • mp4 with H.264 video codec and AAC audio codec
  • Maximum file size: 750MB
  • Minimum resolution: 1080p (lower may negatively impact replica quality)
If you’re creating a personal replica, you must include a verbal consent statement in the video. This ensures ethical use and compliance with data protection laws. Consent is not required for AI-generated training videos. Say the following script clearly in your video:
I, (your name), am currently speaking and give consent to Tavus to create an AI clone of me by using the audio and video samples I provide. I understand that this AI clone can be used to create videos that look and sound like me.
Consent is only required for personal replicas. If you’re creating an AI replica or using AI-generated training video, you can skip this.

Recording Structure

Your video must be one continuous shot, containing 1 minute of speaking followed by 1 minute of listening. You can use a script provided by Tavus or speak on any topic of your choice.
Pro tips:
  • Keep body and head movements subtle
  • Avoid heavy hand gestures
  • Only one person should appear in the video
1

Opening

  • Begin with a big smile showing upper and lower teeth
  • Maintain direct eye contact with the camera for approximately 1 second
2

Speaking Segment (1 Minute)

  • Speak on any topic — content does not matter
  • Open your mouth clearly when speaking
  • Enunciate well, ensuring all teeth are fully visible
  • Keep visible space between your top and bottom teeth
  • Keep head and body movement minimal
  • Avoid hand gestures
  • Avoid sudden head turns
Sample script (optional):
Once upon a time, people built a perfect park in the middle of a busy city. This park was big, bright, and full of playful paths. At sunrise, birds sang above the tall trees. Families carried baskets packed with bread, fruit, and juice.

Children skipped and shouted, chasing balls and flying paper kites. In the afternoon, people played games. Some tapped paddles and bounced plastic balls. Others kicked soccer balls back and forth, laughing loudly with every point scored.

As the day went on, friends gathered for friendly competition. Some threw footballs through the warm air, while others tossed frisbees across the open grass, cheering with every perfect catch. At sunset, the park grew quiet again. People packed up their bags and said goodbye. The golden sky made the grass glow, and soft breezes moved through the leaves.

Today, parks are still places where people gather to play, to talk, and to breathe fresh air. From simple paths to shining playgrounds, parks bring peace, play, and plenty of happy moments. Places like that remain alive with voices, faces, and feelings, promising joy again tomorrow.
3

Listening Segment (1 Minute)

  • Transition naturally into a listening posture
  • Keep lips neutral and closed throughout
  • Maintain a steady head position
  • Avoid exaggerated expressions
  • Do not lick lips or form unusual mouth shapes
  • An occasional closed-lip smile is recommended
Replica training typically takes 4–5 hours. You can track the training progress by:

High-Quality Training Example

Full Body Replica

To create a full body replica for conversational video, follow these guidelines:

Framing & Orientation

  • The subject must be captured from head to toe, with no extra space above or below.
  • Record in vertical format (portrait mode) or crop appropriately to maintain vertical framing.

Posture & Movement

  • Remain standing still throughout the recording.
  • Avoid hand gestures or exaggerated body movements to maintain consistency and model quality.

Resolution & Quality

  • A 4K resolution is recommended for best results.
  • Ensure consistent lighting, with no shadows or sudden changes in exposure.