The Voice Clone Pipeline
Personal narration without recording yourself.
- The Reel Factory· The protocol that built the reels you're watching.
- The Sample Lawyer· The clearance pipeline for the sample you already used.
- AI plays Ableton· Claude controls Live through MCP. Sixteen tracks built by talking.
xml<speak version="1.1" xml:lang="en-MY"> <prosody rate="98%" pitch="-1st"> There is a perfectly legal way to outsource this. <break time="280ms"/> Nobody tells you, because nobody profits from telling you. </prosody> <break time="600ms"/> <prosody rate="103%" pitch="-2st" volume="+1dB"> Here is how it works. </prosody> </speak>
- SSML prosody markup
- audio mastering (dynamic range, LUFS targeting)
- phoneme-aligned captions (WebVTT / SRT)
- ElevenLabs voice-clone consent ethics
- the uncanny valley in prosody
- right-of-publicity statutes
- the difference between cloning your voice and faking someone else's
If a name is unfamiliar, that's the gap. The list is the curriculum.
- 01
Record 90 seconds of your own clean speech. Read varied sentences (statements, questions, lists). Clone once.
- 02
Write the script in your written register. The clone preserves your typed cadence better than your spoken one.
- 03
Convert prose to SSML. Rate around 95-103%; pitch ±1-2 semitones at section breaks; break tags before commas turn the rhythm human.
- 04
Render via ElevenLabs (turbo for drafts; v2 for finals). Generate two takes per line; pick the one that drops the AI tell.
pythonelevenlabs.text_to_speech.convert(voice_id, text=ssml, model_id='eleven_turbo_v2_5') - 05
Master to -16 LUFS for streaming, -19 LUFS for podcast. Compress lightly; the clone runs hot in the upper mids.
bashffmpeg -i raw.wav -af 'loudnorm=I=-16:LRA=11:tp=-1.5' mastered.wav - 06
Generate phoneme-aligned captions from the rendered audio (Whisper word_timestamps=true). Trim to <=42 chars/line.
- 07
Mux into Remotion or final cut. Run one human listen-through; flag any line that uncanny-valleys and re-render that line only.
Most creators avoid talking-head video because of mic anxiety, schedule friction, or imperfect voice. A trained clone removes all three constraints at the cost of one prerequisite that nothing else does: SSML literacy. The clone is only as alive as the markup; flat SSML produces flat audio. Treat the markup as the script and the script as the performance. Master at the right LUFS and the result reads as podcast-grade.