Protocol · TTS prosody × audio mastering → broadcast-grade narration

The Voice Clone Pipeline

Personal narration without recording yourself.

creatorspodcastersvideo producers

Also in Creator's AI

See all in Creator's AI ↗

Artifact

xml
<speak version="1.1" xml:lang="en-MY">
  <prosody rate="98%" pitch="-1st">
    There is a perfectly legal way to outsource this.
    <break time="280ms"/>
    Nobody tells you, because nobody profits from telling you.
  </prosody>
  <break time="600ms"/>
  <prosody rate="103%" pitch="-2st" volume="+1dB">
    Here is how it works.
  </prosody>
</speak>

The SSML opening of a 30-second narration. Rate, pitch, and break tags do the acting.

What you need to know

SSML prosody markup
audio mastering (dynamic range, LUFS targeting)
phoneme-aligned captions (WebVTT / SRT)
ElevenLabs voice-clone consent ethics
the uncanny valley in prosody
right-of-publicity statutes
the difference between cloning your voice and faking someone else's

If a name is unfamiliar, that's the gap. The list is the curriculum.

The recipe

01
Record 90 seconds of your own clean speech. Read varied sentences (statements, questions, lists). Clone once.
02
Write the script in your written register. The clone preserves your typed cadence better than your spoken one.
03
Convert prose to SSML. Rate around 95-103%; pitch ±1-2 semitones at section breaks; break tags before commas turn the rhythm human.
04
Render via ElevenLabs (turbo for drafts; v2 for finals). Generate two takes per line; pick the one that drops the AI tell.
```
python
elevenlabs.text_to_speech.convert(voice_id, text=ssml, model_id='eleven_turbo_v2_5')
```
05
Master to -16 LUFS for streaming, -19 LUFS for podcast. Compress lightly; the clone runs hot in the upper mids.
```
bash
ffmpeg -i raw.wav -af 'loudnorm=I=-16:LRA=11:tp=-1.5' mastered.wav
```
06
Generate phoneme-aligned captions from the rendered audio (Whisper word_timestamps=true). Trim to <=42 chars/line.
07
Mux into Remotion or final cut. Run one human listen-through; flag any line that uncanny-valleys and re-render that line only.

Receipt

Mind on Trial reel series: 21 episodes, ~14 minutes of narration, zero hours of voice recording after the initial clone.

Original voice clone took 12 minutes to capture. Subsequent episodes: write script, run SSML pass, render, master, caption, mux. ~25 minutes per episode end-to-end; the human is only present at the final listen-through.

Why it works

Most creators avoid talking-head video because of mic anxiety, schedule friction, or imperfect voice. A trained clone removes all three constraints at the cost of one prerequisite that nothing else does: SSML literacy. The clone is only as alive as the markup; flat SSML produces flat audio. Treat the markup as the script and the script as the performance. Master at the right LUFS and the result reads as podcast-grade.

Adjacent

The Reel Factory↗
The protocol that built the reels you're watching.

← Previous in Creator's AI

The Sample Lawyer

The clearance pipeline for the sample you already used.

Next in Creator's AI →

The Reel Factory

The protocol that built the reels you're watching.