Protocol · TTS prosody × audio mastering → broadcast-grade narration

The Voice Clone Pipeline

Personal narration without recording yourself.

creatorspodcastersvideo producers
Also in Creator's AI
See all in Creator's AI
Artifact
xml
<speak version="1.1" xml:lang="en-MY"> <prosody rate="98%" pitch="-1st"> There is a perfectly legal way to outsource this. <break time="280ms"/> Nobody tells you, because nobody profits from telling you. </prosody> <break time="600ms"/> <prosody rate="103%" pitch="-2st" volume="+1dB"> Here is how it works. </prosody> </speak>
The SSML opening of a 30-second narration. Rate, pitch, and break tags do the acting.
What you need to know
  • SSML prosody markup
  • audio mastering (dynamic range, LUFS targeting)
  • phoneme-aligned captions (WebVTT / SRT)
  • ElevenLabs voice-clone consent ethics
  • the uncanny valley in prosody
  • right-of-publicity statutes
  • the difference between cloning your voice and faking someone else's

If a name is unfamiliar, that's the gap. The list is the curriculum.

The recipe
  1. 01

    Record 90 seconds of your own clean speech. Read varied sentences (statements, questions, lists). Clone once.

  2. 02

    Write the script in your written register. The clone preserves your typed cadence better than your spoken one.

  3. 03

    Convert prose to SSML. Rate around 95-103%; pitch ±1-2 semitones at section breaks; break tags before commas turn the rhythm human.

  4. 04

    Render via ElevenLabs (turbo for drafts; v2 for finals). Generate two takes per line; pick the one that drops the AI tell.

    python
    elevenlabs.text_to_speech.convert(voice_id, text=ssml, model_id='eleven_turbo_v2_5')
  5. 05

    Master to -16 LUFS for streaming, -19 LUFS for podcast. Compress lightly; the clone runs hot in the upper mids.

    bash
    ffmpeg -i raw.wav -af 'loudnorm=I=-16:LRA=11:tp=-1.5' mastered.wav
  6. 06

    Generate phoneme-aligned captions from the rendered audio (Whisper word_timestamps=true). Trim to <=42 chars/line.

  7. 07

    Mux into Remotion or final cut. Run one human listen-through; flag any line that uncanny-valleys and re-render that line only.

Receipt
Mind on Trial reel series: 21 episodes, ~14 minutes of narration, zero hours of voice recording after the initial clone.
Original voice clone took 12 minutes to capture. Subsequent episodes: write script, run SSML pass, render, master, caption, mux. ~25 minutes per episode end-to-end; the human is only present at the final listen-through.
Why it works

Most creators avoid talking-head video because of mic anxiety, schedule friction, or imperfect voice. A trained clone removes all three constraints at the cost of one prerequisite that nothing else does: SSML literacy. The clone is only as alive as the markup; flat SSML produces flat audio. Treat the markup as the script and the script as the performance. Master at the right LUFS and the result reads as podcast-grade.

Adjacent
← Previous in Creator's AI
The Sample Lawyer
The clearance pipeline for the sample you already used.
Next in Creator's AI
The Reel Factory
The protocol that built the reels you're watching.