How Text-to-Speech Models Work — From Waveforms to Voice Cloning

There's something uncanny about hearing a machine speak in a voice indistinguishable from a human's. Not the robotic monotone of old GPS units, but the real thing — the breath, the hesitation, the warmth. In the last two years, text-to-speech has crossed from "good enough" to "wait, that's not a person?" This tutorial walks through how that works: from the physics of sound, through neural audio codecs, to the three architectures pushing the state of the art today.

We'll build the intuition layer by layer. By the end, you'll understand why a 100M-parameter model can clone your voice from 5 seconds of audio, why an 82M model can run on your phone, and why a 1.6B model can carry on a conversation that sounds genuinely alive.

Waveform → Codec → Tokens / Latents → Language Model → Decode → Speech

0. Foundations: Audio

Before we can build a speech synthesizer, we need to understand what sound is to a computer. This is the equivalent of understanding pixels before you write a shader.

The waveform

Sound is pressure variation in air. A microphone converts that into a waveform — a one-dimensional signal measuring amplitude over time. At 24,000 Hz sample rate (the standard for modern TTS), one second of speech is a sequence of 24,000 floating-point numbers:

waveform = [s₁, s₂, s₃, … , s₂₄₀₀₀]

Each sᵢ ∈ [−1.0, 1.0] — air pressure at time i/24000 seconds

This is incredibly high-resolution. A 10-second utterance is 240,000 numbers. Predicting each one autoregressively — "given the first 23,999 samples, what's sample 24,000?" — is computationally suicidal. The sequences are too long, the dependencies too fine-grained. No language model can attend over 240k tokens efficiently.

Early neural TTS systems like WaveNet (2016) actually did predict one sample at a time. It worked beautifully but took minutes to generate one second of audio. The entire modern TTS revolution is about avoiding this bottleneck.

The spectrogram

Humans don't hear individual samples — we hear frequencies. The Short-Time Fourier Transform (STFT) converts a waveform into a 2D representation: time on one axis, frequency on the other. Each cell contains the energy at that frequency during that time window:

STFT(t, f) = |Σₙ w(n) · s(t·hop + n) · e^(−2πi·f·n/N)|²

w(n) = window function (Hann), hop = step size, N = FFT size

The spectrogram is a massive dimensionality reduction. Instead of 24,000 samples per second, we might get 100 time frames per second, each with 1,025 frequency bins. Still large, but manageable.

The mel spectrogram

Human hearing is logarithmic in frequency. We can distinguish between 200 Hz and 400 Hz easily, but 8,000 Hz and 8,200 Hz sound nearly identical. The mel scale warps the frequency axis to match human perception, and we apply triangular filter banks to compress 1,025 frequency bins down to 80 or 128 mel bins:

mel(f) = 2595 · log₁₀(1 + f/700)

80 mel bins × 100 frames/sec = 8,000 values per second

This is 3× fewer values than the raw waveform. For years, mel spectrograms were the lingua franca of TTS — the model predicted a mel spectrogram from text, then a separate vocoder (like Griffin-Lim, WaveGlow, or HiFi-GAN) turned it back into a waveform.

The mel spectrogram throws away phase information — it only keeps magnitude. This is why you need a neural vocoder to reconstruct audio; you can't simply invert the mel spectrogram. The vocoder has to "hallucinate" plausible phase, which is why older TTS often sounded slightly metallic.

The fundamental problem

Even mel spectrograms are too verbose for language models. 8,000 continuous values per second, over sequences of 10–30 seconds, means context windows of 80,000–240,000. Modern LLMs work best with discrete tokens and context lengths under 8,000. We need a way to compress speech into a very short sequence of tokens — maybe 12.5 per second. Enter: neural audio codecs.

Think about it: text at ~4 tokens/second, audio at 24,000 samples/second. A codec that compresses to 12.5 tokens/second brings audio into the same ballpark as text — which is exactly what lets us use language models for speech.

1. The Codec

The neural audio codec is the breakthrough that made modern TTS possible. It's an autoencoder — a neural network that compresses audio to a tiny representation and reconstructs it back. But unlike JPEG or MP3, the compression happens in a learned latent space, and the representation can be either discrete tokens or continuous vectors.

Encoder → Bottleneck → Decoder

The architecture is conceptually simple. An encoder compresses raw audio into a sequence of latent vectors. A decoder reconstructs audio from those vectors. The bottleneck between them is where the magic happens — it forces the network to learn a compact, meaningful representation of sound:

Raw waveform (24,000 samples/sec) │ ▼ ┌─────────────────────────────────┐ │ Encoder (CNN + Transformer) │ Downsamples by ~1920× │ Strided convolutions reduce │ │ temporal resolution │ └──────────────┬──────────────────┘ │ ▼ Latent vectors: 12.5 frames/sec, dim=128 │ ┌────────┴────────┐ │ Quantization? │ ← This is where architectures diverge └────────┬────────┘ │ ▼ ┌─────────────────────────────────┐ │ Decoder (Transpose CNN) │ Upsamples back to 24kHz │ Mirror of encoder │ └─────────────────────────────────┘ │ ▼ Reconstructed waveform (24,000 samples/sec)

The encoder uses strided convolutions — each layer reduces the time dimension by a factor (typically [8, 5, 4, 2] = 320× total downsampling). So 24,000 samples/sec ÷ 1920 ≈ 12.5 latent frames per second. Each frame is a vector of 128 or 256 dimensions — a compressed "snapshot" of ~80ms of audio.

Vector quantization

Here's the key insight: language models predict discrete tokens — integers from a vocabulary, like word IDs. Raw latent vectors are continuous (infinite possible values). To make audio compatible with language models, we need to quantize — snap each continuous vector to the nearest entry in a learned codebook:

Given latent vector z ∈ ℝ¹²⁸ and codebook C = {c₁, c₂, …, c₁₀₂₄}

quantize(z) = argmin_i ‖z − cᵢ‖²

The index i becomes the discrete token. The codebook entry cᵢ is the reconstructed vector.

This is exactly like color quantization in image processing — you have a palette of 1,024 "colors" (codebook vectors), and each pixel (latent frame) gets mapped to the closest palette entry. The codebook is learned during training via the commitment loss:

Python# Vector Quantization forward pass
def quantize(z, codebook):
    # Find nearest codebook entry for each latent vector
    distances = cdist(z, codebook)       # [batch, seq, codebook_size]
    indices = distances.argmin(dim=-1)  # [batch, seq] — the discrete tokens!
    z_q = codebook[indices]               # quantized vectors

    # Losses to train the codebook
    commitment = mse(z.detach(), z_q)   # move codebook toward encoder output
    embedding  = mse(z, z_q.detach())   # move encoder output toward codebook

    # Straight-through estimator: forward uses z_q, backward uses z
    z_q = z + (z_q - z).detach()

    return z_q, indices, commitment + 0.25 * embedding

The straight-through estimator is a wonderful hack. argmin is non-differentiable — there's no gradient through "find the nearest codebook entry." So during the backward pass, we pretend quantization didn't happen and pass the gradient straight from the decoder to the encoder. It shouldn't work, but it does.

Residual Vector Quantization (RVQ)

A single codebook with 1,024 entries can only represent 10 bits of information per frame. That's roughly MP3-at-8kbps quality — barely intelligible. We need more bits, but simply making the codebook huge (e.g., 2²⁰ entries) makes lookup intractable.

RVQ solves this brilliantly: use multiple small codebooks in sequence. The first codebook captures the coarse shape of the audio. The second codebook quantizes the residual — the error left after the first. The third quantizes the error of the error. And so on:

Level 1: q₁ = quantize(z) → captures rough shape
Level 2: q₂ = quantize(z − q₁) → captures mid detail
Level 3: q₃ = quantize(z − q₁ − q₂) → captures fine detail
...
Level K: qₖ = quantize(z − Σᵢ₌₁ᵏ⁻¹ qᵢ) → captures residual detail

Reconstructed: ẑ = q₁ + q₂ + q₃ + … + qₖ

With 8 levels of 1,024-entry codebooks, we get 8 × 10 = 80 bits per frame. At 12.5 frames/second, that's 1,000 bits/second = 1 kbps. For comparison, MP3 uses 128 kbps. Neural codecs achieve comparable quality at 1/128th the bitrate.

Pythonclass ResidualVQ(nn.Module):
    def __init__(self, num_levels=8, codebook_size=1024, dim=128):
        self.levels = nn.ModuleList([
            VectorQuantizer(codebook_size, dim)
            for _ in range(num_levels)
        ])

    def forward(self, z):
        all_indices = []
        residual = z
        quantized_sum = 0

        for level in self.levels:
            q, indices, loss = level(residual)
            quantized_sum += q
            residual = z - quantized_sum  # what's left to explain
            all_indices.append(indices)

        return quantized_sum, all_indices   # 8 token streams!

This creates a natural hierarchy. The first RVQ level captures semantic content — what is being said, the rough melody of speech. Later levels capture acoustic detail — the speaker's timbre, room reverb, microphone characteristics. This split is crucial: it means a language model can predict the "meaning" of speech (level 1) separately from its "texture" (levels 2–8).

Think of RVQ like progressive JPEG. Level 1 gives you a blurry but recognizable image. Each subsequent level sharpens details. You could stop at level 4 and still understand the speech — it would just sound like a phone call instead of a studio recording.

The codec zoo: EnCodec, SoundStream, Mimi

Three major codecs dominate the TTS landscape:

Codec	From	Frame rate	RVQ levels	Architecture	Key trick
EnCodec	Meta, 2022	75 Hz	8	CNN + LSTM	Multi-scale discriminator
SoundStream	Google, 2021	50 Hz	12	CNN	Residual VQ (original)
Mimi	Kyutai, 2024	12.5 Hz	8	CNN + Transformer	Semantic distillation

Mimi is the most interesting. At 12.5 Hz (one frame per 80ms), it produces far fewer tokens than EnCodec's 75 Hz. That's 6× fewer tokens for the language model to predict. Mimi achieves this extreme compression by adding a Transformer layer inside the codec and by distilling semantic knowledge from WavLM (a speech understanding model) into the first RVQ level:

Python# Mimi's semantic distillation loss
# Forces the first RVQ level to encode "what is being said"
# rather than acoustic details

wavlm_features = wavlm_encoder(audio)         # semantic targets
codec_features = codec.encoder(audio)           # learned features
first_level = codec_features[:1]               # first RVQ level only

semantic_loss = 1.0 - cosine_similarity(
    first_level, wavlm_features
)  # push first level to encode meaning, not acoustics

The semantic distillation is a beautiful design choice. By forcing the first RVQ level to encode "meaning" (phonemes, prosody) and letting later levels handle "texture" (timbre, reverb), the language model only needs to nail the semantic tokens. A simpler decoder model can handle the acoustic refinement. This is exactly the split that CSM exploits with its dual-transformer design.

Going continuous: Pocket TTS's radical choice

Pocket TTS breaks from all of the above. Instead of discrete tokens, it predicts continuous latent vectors directly. The codec is modified to output Gaussian-distributed latents (like a VAE) instead of quantized codes:

Traditional: audio → encoder → RVQ → [token₁, token₂, …] → LM → RVQ⁻¹ → decoder → audio

Pocket TTS: audio → encoder → z ~ 𝒩(μ, σ²) → LM → continuous ẑ → decoder → audio

No codebook. No quantization. No information loss at the bottleneck.

Why? Because quantization is a bottleneck. Each codebook entry is a lossy approximation. When you shrink the model (fewer parameters), the language model struggles to predict tokens precisely — a wrong token in level 1 cascades into garbage audio. Continuous latents avoid this entirely. The model just needs to get "close enough" to the right vector, and the decoder smoothly interpolates.

The trade-off: you can't use standard cross-entropy loss (which requires discrete targets). Instead, Pocket TTS uses a flow matching loss with Lagrangian Self-Distillation (LSD), which we'll cover in Section 5.

2. The Language Model

Once we have a compact audio representation — whether discrete tokens or continuous latents — we need a model that can generate them from text. This is where the "language" in "language model" becomes literal: we're treating speech generation as a next-token prediction problem, just like GPT predicts the next word.

Autoregressive generation

The core loop is identical to text generation. Given a sequence of past audio tokens, predict the next one:

Python# Simplified autoregressive TTS generation
def generate_speech(text, model, codec, max_frames=500):
    text_tokens = tokenize(text)            # "Hello" → [4, 829, 1305]
    audio_tokens = []

    for step in range(max_frames):
        # Concatenate text + generated audio so far
        input_seq = text_tokens + audio_tokens

        # Predict next audio token
        logits = model(input_seq)             # [vocab_size] probabilities
        next_token = sample(logits)           # temperature sampling
        audio_tokens.append(next_token)

        if next_token == EOS:
            break

    # Decode tokens back to audio
    waveform = codec.decode(audio_tokens)   # 500 tokens → 40 sec audio
    return waveform

At 12.5 frames/second, a 10-second utterance is only 125 tokens — well within the attention window of any modern transformer. Compare that to 240,000 raw samples or 1,000 mel spectrogram frames. The codec is what made this tractable.

The multi-stream problem

With RVQ, we don't have one token per frame — we have K tokens (one per RVQ level). A frame at 12.5 Hz with 8 RVQ levels means 100 tokens per second. Naively flattening this to one long sequence works but is wasteful — level 8 (subtle acoustic detail) doesn't need the full power of a billion-parameter transformer.

Three strategies have emerged:

Delayed patterns (SoundStorm, CSM)

Offset each RVQ level by one timestep. At each position, the model sees level 1 of the current frame, level 2 of the previous frame, level 3 of the frame before that, etc. This allows all levels to be predicted in parallel while maintaining causal ordering:

Time: t=0 t=1 t=2 t=3 t=4 Level 1: a₁ b₁ c₁ d₁ e₁ Level 2: — a₂ b₂ c₂ d₂ Level 3: — — a₃ b₃ c₃ Level 4: — — — a₄ b₄ At time t=3, the model predicts d₁ (coarse) while also outputting c₂, b₃, a₄ (fine details for earlier frames)

Dual transformer (CSM)

A large backbone transformer handles text + first RVQ level (semantics). A smaller depth decoder predicts remaining levels (acoustics). The backbone runs once per frame; the decoder runs K times per frame but is much cheaper:

Python# CSM's dual-transformer approach
# Backbone: Llama-1B — sees text + semantic tokens
backbone_out = backbone(text_tokens + semantic_audio_tokens)

# Depth decoder: small transformer — predicts acoustic tokens
for level in range(1, 8):  # levels 2-8
    acoustic_token = depth_decoder(
        backbone_out[-1],       # conditioning from backbone
        prev_level_tokens        # tokens from levels 1..level-1
    )

Single-step continuous (Pocket TTS)

Predict the entire continuous vector in one shot — no RVQ levels, no multi-step decoding. An MLP head takes the transformer's output and generates the full latent vector in a single forward pass.

Text conditioning

The model needs to know what to say. Different systems handle this differently:

Phoneme input (Kokoro): Text is first converted to IPA phonemes — "hello" → [h ə l oʊ]. This removes ambiguity (is "read" past or present tense?) and makes the modeling task easier. The trade-off: you need a handcrafted grapheme-to-phoneme (G2P) pipeline.
Subword input (CSM, Pocket TTS): Standard SentencePiece or BPE tokenizer, same as text LLMs. The model learns pronunciation implicitly. More flexible (handles names, code, foreign words) but requires more data.

Kokoro's use of phonemes is a deliberate trade-off for model size. At 82M parameters, the model doesn't have capacity to learn English orthography ("ough" = /ʌf/ in "tough", /oʊ/ in "though", /uː/ in "through"). Phonemes remove this burden entirely. Larger models like CSM (1.6B) can afford to learn it from data.

3. Three Architectures

Now we have the building blocks. Let's see how three models assemble them into complete systems — each making fundamentally different design choices.

Kokoro: the efficient specialist

Kokoro (82M parameters) isn't a language model at all — it's a non-autoregressive system derived from StyleTTS 2. It predicts the entire mel spectrogram at once, then converts to audio with a specialized vocoder. No codecs, no tokens, no autoregressive loop:

Text → G2P (espeak) → Phonemes → Duration predictor → Acoustic model → iSTFTNet → Audio

Voicepacks instead of voice cloning

StyleTTS 2 uses a diffusion model to sample speaker styles at inference time — powerful but slow. Kokoro replaces this with pre-computed voicepacks: 512-dimensional style vectors extracted from reference audio and baked into files. At inference, you just load a voicepack — no iterative diffusion, no reference audio processing:

Python# Kokoro voice conditioning — voicepack is a frozen 512-d vector
voicepack = torch.load("voices/af_heart.pt")  # [512]

# The style vector is injected into every layer via AdaIN
# (Adaptive Instance Normalization — same as in style transfer)
def ada_in(content, style):
    mean, std = content.mean(), content.std()
    style_mean, style_std = style_mlp(style)  # learned projection
    return style_std * (content - mean) / std + style_mean

Voicepacks are the key to Kokoro's speed. A diffusion-based style model might take 50 denoising steps per utterance. A voicepack is a single matrix multiply. The trade-off: you can't clone arbitrary voices at inference time — someone has to create the voicepack first. This is why the Pocket TTS blog post calls Kokoro's voice set "fixed."

iSTFTNet: math instead of neural upsampling

Traditional vocoders (HiFi-GAN) use learned transposed convolutions to upsample from mel spectrogram to waveform. Each upsampling layer doubles or quadruples the time resolution, adding parameters and compute.

Kokoro's iSTFTNet takes a shortcut: the network only needs to predict magnitude and phase at a lower resolution. Then the inverse STFT — a purely mathematical operation, no learnable parameters — reconstructs the full waveform:

Traditional vocoder: mel → [upsample ×8] → [upsample ×8] → [upsample ×4] → waveform
(each upsample = transposed conv + activation + residual block)

iSTFTNet: mel → [lightweight net] → magnitude + phase → iSTFT → waveform
(iSTFT is a fixed matrix multiply, not learned)

The result: fewer parameters, less compute, and often better phase coherence because the iSTFT enforces physical consistency that neural upsampling doesn't.

CSM: the conversational model

CSM (Conversational Speech Model, 1.66B parameters) from Sesame AI Labs is fundamentally different. It's an autoregressive transformer that generates discrete audio tokens — essentially a text LLM repurposed for speech. Its unique strength is conversational context: it models multi-turn dialogue, adapting tone and emotion based on what was said before.

Text + History → Llama backbone → Semantic tokens → Depth decoder → Acoustic tokens → Mimi decode → Audio

The dual transformer

CSM's backbone is Llama 3.2 1B — literally the same architecture as Meta's text LLM, trained from scratch on interleaved text and audio tokens. It processes both modalities in a unified sequence:

Python# CSM input sequence for a 2-turn conversation
input_sequence = [
    # Turn 1: user's speech (text + audio tokens interleaved)
    TEXT_START, *user_text_tokens, TEXT_END,
    AUDIO_START, *user_audio_tokens_level1, AUDIO_END,

    # Turn 2: assistant's response (text given, audio generated)
    TEXT_START, *response_text_tokens, TEXT_END,
    AUDIO_START,  # model generates from here →
]

# Backbone predicts semantic tokens (RVQ level 1)
# Then depth decoder fills in acoustic detail (levels 2-8)

The key innovation: the backbone sees entire conversation history. If the previous turn was angry, the model learns to generate a cautious, conciliatory response. If the previous turn was a joke, the response might have a lighter cadence. This is what Sesame calls "voice presence" — the model doesn't just read text aloud; it performs it.

Compute amortization

Training the depth decoder on every frame of every RVQ level would be extremely expensive. CSM's trick: randomly sample 1/16th of frames for depth decoder training. The backbone still sees every frame (it needs full context), but the acoustic decoder only trains on a random subset. This cuts memory by ~6× with negligible quality loss.

Pocket TTS: the continuous rebel

Pocket TTS (100M parameters) from Kyutai is the newest and most radical. It throws away discrete tokens entirely, predicting continuous latent vectors with a single-step flow model. The result: a model 16× smaller than CSM that runs faster than real-time on a laptop CPU.

Voice sample → VAE encoder → Prefix + Text → Causal transformer → LSD head → Continuous latents → VAE decoder → Audio

Why continuous works

Pocket TTS was distilled from Kyutai TTS 1.6B — a full-size model with RVQ tokens. The insight: when you try to make a discrete-token model smaller, the RQ-transformer (which predicts multiple RVQ levels) becomes the bottleneck. It's hard to shrink without losing quality. But if you remove quantization entirely, the model just needs to predict a single vector per frame — one forward pass through a small MLP:

Python# Pocket TTS: one MLP head replaces the entire RQ-transformer
class LSDHead(nn.Module):
    """Single-step flow matching head via Lagrangian Self-Distillation"""
    def __init__(self, hidden_dim=512, latent_dim=128):
        self.net = nn.Sequential(
            nn.Linear(hidden_dim + latent_dim + 1, 512),  # z + noise + t
            nn.GELU(),
            nn.Linear(512, 512),
            nn.GELU(),
            nn.Linear(512, latent_dim),  # predict clean latent
        )

    def forward(self, transformer_output, noisy_latent, timestep):
        # Predict the clean latent in ONE step
        # No iterative denoising, no diffusion chain
        x = torch.cat([transformer_output, noisy_latent, timestep], dim=-1)
        return self.net(x)  # → clean continuous latent

Model size breakdown

Component	Params	Role
Causal transformer	84M	Context modeling, text understanding
LSD head (MLP)	6M	Single-step latent generation
VAE decoder	10M	Latent → waveform
VAE encoder	18M	Voice sample → prefix (inference only)
Total (generation)	100M	Encoder not needed after first use

For comparison: CSM's backbone alone is 1B parameters. Pocket TTS achieves comparable quality in 1/10th the parameters by (a) removing quantization overhead, (b) using a 6-layer transformer (distilled from 24), and (c) replacing multi-step decoding with a single MLP forward pass.

4. Voice Cloning & Conditioning

Making a model speak is one thing. Making it speak in your voice is another. This is the dimension where the three architectures diverge most sharply.

The voice identity problem

A voice has dozens of characteristics: pitch range, formant frequencies, speaking rate, breathiness, nasality, accent, vocal fry, the way consonants are released. Capturing all of this in a way a model can use requires either (a) a compact embedding that encodes these traits, or (b) a raw audio prefix that the model learns to imitate.

Strategy 1: Fixed voicepacks (Kokoro)

Kokoro uses pre-computed style vectors — 512 floating-point numbers that encode everything about a voice. These are extracted by running reference audio through a style encoder during training, then saved as .pt files:

voicepack = StyleEncoder(reference_audio) → ℝ⁵¹²

This vector is frozen at inference time.
No voice cloning — only voices that have been pre-extracted.

Voicepacks are injected via Adaptive Instance Normalization (AdaIN) at every layer. This is the same technique used in neural style transfer — the content (phoneme sequence) passes through the network, and at each layer, the mean and variance of the activations are shifted to match the target style:

AdaIN(x, style) = σ_style · (x − μ_x) / σ_x + μ_style

where μ_style, σ_style = MLP(voicepack)

Advantages: Fast, deterministic, no reference audio needed at inference. Limitations: Can't clone an unseen voice without re-extracting the style vector (which requires the style encoder and isn't part of the public model).

Strategy 2: Conversational context (CSM)

CSM doesn't have explicit speaker embeddings. Instead, voice identity is carried by the audio token history. When you provide audio context — previous turns of a conversation — the model attends to those tokens and implicitly extracts the speaker's characteristics:

Python# CSM "voice cloning" via context
# Provide a reference utterance as conversation history
context = [
    Segment(
        speaker=0,
        text="This is my natural speaking voice.",
        audio=reference_audio   # 5-10 seconds of target voice
    )
]

# Generate in the same voice
output = csm.generate(
    text="Now I'll say something new.",
    speaker=0,                 # same speaker ID as context
    context=context              # model attends to these tokens
)

This is in-context learning — the same mechanism that lets GPT-4 follow instructions without fine-tuning. The model has seen millions of conversations during training and learned to match speaking style within a conversation. The quality depends heavily on the reference audio: more context = better voice matching.

Strategy 3: Audio prefix (Pocket TTS)

Pocket TTS uses the most direct approach: encode the reference audio with the codec encoder, and prepend it to the generation sequence. The model simply continues generating audio that sounds like the prefix:

Python# Pocket TTS voice cloning
reference_latents = codec.encoder(reference_audio)  # [~62 frames for 5 sec]
text_tokens = tokenizer(text)                        # [~20 tokens]

# Input: [voice_latents... text_tokens... ]
# The model generates audio latents that continue the voice
generated_latents = model.generate(
    prefix=reference_latents,   # voice identity
    text=text_tokens,           # what to say
    temperature=0.7,           # Gaussian temperature
    cfg_alpha=1.5              # classifier-free guidance
)

audio = codec.decode(generated_latents)

Because the voice conditioning lives in the same continuous space as the generation target, the model naturally captures everything about the reference: voice color, emotion, accent, cadence, room acoustics, microphone characteristics. Kyutai reports that 5 seconds of reference audio is sufficient for faithful voice reproduction.

Pocket TTS's approach is elegant: the codec encoder (18M params) only runs once per voice sample. After that, the voice embedding can be cached and reused for unlimited generations. The generation path (100M params) never needs to see the raw reference audio again.

Strategy 4: Fine-tuning with LoRA

All three models can be adapted to a specific voice through fine-tuning, but the most practical approach is LoRA (Low-Rank Adaptation). Instead of updating all parameters, LoRA adds small trainable matrices to the attention layers:

Original: y = Wx (W is frozen, billions of params)

LoRA: y = Wx + BAx (B ∈ ℝᵈˣʳ, A ∈ ℝʳˣᵈ, r ≪ d)

With r=32 and d=2048: W has 4M params, BA has 131K params (3.2%)

Python# LoRA config for CSM fine-tuning (from csm-finetune report)
from peft import LoraConfig

config = LoraConfig(
    r=32,                    # rank — controls capacity vs efficiency
    lora_alpha=32,           # scaling factor (usually = r)
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",  # attention
        "gate_proj", "up_proj", "down_proj",      # FFN
    ],
    lora_dropout=0.0,
)

# Result: 29M trainable params out of 1.66B total (1.75%)
# Peak VRAM: 6.2 GB on A100 (vs ~40 GB for full fine-tuning)
# Adapter size: ~100 MB (vs multi-GB for full weights)

LoRA fine-tuning for TTS is remarkably data-efficient. For CSM, the Elise voice dataset contains only 1,195 examples. Training for 60 steps with batch size 8 takes ~15 minutes on an A100. The resulting voice is noticeably different from the base model — the speaker identity shifts toward the target while the model retains its ability to speak naturally.

LoRA rank is a knob: r=8 for simple voice clone (accent, pitch), r=32 for deeper adaptation (speaking style, emotion), r=64 for major changes (new language). Higher rank = more capacity but more overfitting risk. For most single-voice cloning, r=16–32 is the sweet spot.

5. Training: Data, Compute & Tricks

You have the architecture. Now: how do you actually train one of these things? The data requirements, loss functions, and training tricks are where the real craft lies.

How much data?

The data requirements vary by over three orders of magnitude:

Model	Training data	Hours	Cost estimate
Kokoro v0.19	LibriVox + synthetic	<100 hours	~$400–$1,000 (A100s)
Kokoro v1.0	LibriVox + Apache + synthetic	~200–300 hours	~$1,000–$2,000
Pocket TTS	AMI, GigaSpeech, Emilia, etc.	88,000 hours	Not disclosed
CSM	Public transcribed English	~1,000,000 hours	Not disclosed (H100s)
LoRA fine-tune	Single speaker dataset	1–10 hours	~$5–$50 (A100 spot)

The pattern: architectural efficiency trades against data requirements. Kokoro uses phonemes and voicepacks to pre-structure the problem, so it needs less data to learn the remaining mapping. CSM uses raw text tokens and learns everything end-to-end — more flexible, but it needs 10,000× more data.

Pocket TTS at 88,000 hours is interesting — it's purely public datasets, deliberately chosen for reproducibility. Kyutai notes they're "excited to see how far the method can be pushed with additional private data." The 88k hours is likely far below the model's capacity.

Dataset composition matters

It's not just about hours — it's about diversity. Pocket TTS uses 8 different datasets specifically to cover different domains:

AMI — meeting recordings (natural conversation, overlapping speech)
EARNINGS22 — earnings calls (formal speech, financial terminology)
GigaSpeech — podcasts and YouTube (casual, varied acoustics)
SPGISpeech — financial transcripts (clear enunciation)
TED-LIUM — TED talks (public speaking, emotional range)
VoxPopuli — EU Parliament (accented English)
LibriHeavy — audiobooks (expressive reading)
Emilia — large-scale diverse speech (2024, multi-domain)

A model trained only on audiobooks would sound great reading poetry but collapse on conversational speech. A model trained only on phone calls would lack expressiveness. The mix is critical.

Loss functions for TTS

Different architectures use different losses, but they share a common theme: reconstruct audio well, while also fooling a discriminator.

Codec training losses

Python# Training a neural audio codec (EnCodec, Mimi, etc.)
L_total = (
    # 1. Reconstruction: spectrogram should match
    L_mel             # L1 loss on mel spectrogram (perceptual)
    + L_stft          # multi-scale STFT loss (time-frequency)

    # 2. Adversarial: fool the discriminator
    + L_adv           # GAN loss (make discriminator think output is real)
    + L_feat          # feature matching loss (match discriminator internals)

    # 3. Quantization: stable codebooks
    + L_commit        # commitment loss (codebook ↔ encoder alignment)

    # 4. Semantic (Mimi only): first level should encode meaning
    + L_semantic      # cosine similarity with WavLM features
)

Language model losses

For discrete-token models (CSM), the loss is just cross-entropy — the same as GPT:

L = −Σₜ log P(token_t | token₁, …, token_{t−1})

Exactly the same as language model pre-training.

For continuous-latent models (Pocket TTS), the loss is flow matching + LSD:

Lagrangian Self-Distillation (LSD)

Standard flow matching trains a model to predict the velocity field of a diffusion process — moving from noise to clean audio over many steps. The problem: you need 50–100 steps at inference time. LSD makes the model predict the clean output in one step by adding a consistency constraint:

Standard flow matching:
  v_θ(x_t, t) predicts direction from noisy x_t toward clean x₀
  Inference: iterate 50+ steps from x₁ (noise) to x₀ (audio)

LSD addition:
  Also train a "shortcut" that goes directly from x_t to x₀
  The shortcut must agree with the iterative path (self-consistency)
  Inference: ONE step from noise to audio

The "Lagrangian" part: instead of hard-constraining the consistency (which makes optimization unstable), a Lagrange multiplier is introduced that automatically balances flow matching accuracy against single-step shortcut quality. The result: Pocket TTS generates one latent vector per forward pass — no iterative refinement at all.

Classifier-Free Guidance (CFG)

CFG is borrowed from image generation (Stable Diffusion uses it heavily). The idea: during training, randomly drop the conditioning (text/voice) some fraction of the time. At inference, run the model twice — once with conditioning, once without — and amplify the difference:

output = unconditioned + α · (conditioned − unconditioned)

α = 1.0: no guidance (use conditioned output as-is)
α = 1.5: moderate guidance (Pocket TTS default)
α = 3.0: strong guidance (clearer but less natural)
α = 7.0+: over-guided (robotic, distorted)

Python# CFG during inference
def generate_with_cfg(model, text, voice, alpha=1.5):
    # Forward pass WITH conditioning
    z_cond = model(text=text, voice=voice)

    # Forward pass WITHOUT conditioning (text and voice masked)
    z_uncond = model(text=None, voice=None)

    # Extrapolate away from unconditioned toward conditioned
    z_guided = z_uncond + alpha * (z_cond - z_uncond)

    return z_guided

The cost: 2× inference compute (two forward passes). Pocket TTS solves this with distillation — train a student model that produces CFG-quality output in a single pass (see below).

CFG is surprisingly subtle in the audio domain. For images, guidance pushes toward more "typical" outputs (sharper, more saturated). For speech, it pushes toward clearer pronunciation and more distinctive speaker characteristics — but too much makes the voice sound hyperarticulated and unnatural, like a news anchor having a breakdown.

Pocket TTS's Latent CFG

Standard CFG operates on the model's output — the predicted audio token or waveform. But Pocket TTS predicts continuous latents via a one-step flow model, and interpolating in output space for one-step models doesn't work (you'd just be averaging two waveforms, which is layering sounds on top of each other).

Pocket TTS's innovation: apply CFG in the transformer's latent space — on the hidden states $z$ , not the generated audio $x$ :

z_cfg = z_uncond + α · (z_cond − z_uncond) ← latent space

x = LSD_head(z_cfg) ← generate audio from guided latent

The guided latent $z_cfg$ might be out-of-distribution for the LSD head (it was trained on normal latents, not extrapolated ones). Surprisingly, it still works — the LSD head is robust enough to handle the shifted input, and the quality improvement is significant.

Head Batch Multiplier

Training with flow matching is bottlenecked by the transformer backbone — it's much larger than the MLP head. Pocket TTS's trick: reuse each backbone output 8 times:

Python# Head Batch Multiplier (N=8)
z = transformer(input_sequence)  # expensive! run once per batch

total_loss = 0
for i in range(8):  # reuse z eight times
    noise = torch.randn_like(target_latent)  # fresh noise each time
    t = torch.rand(1)                        # fresh timestep
    predicted = lsd_head(z, noise, t)
    total_loss += flow_matching_loss(predicted, target_latent)

loss = total_loss / 8  # amortized over 8 samples

Each of the 8 iterations uses different random noise and timesteps, so the LSD head sees diverse training signal from a single backbone pass. This is 8× more efficient and also stabilizes training by averaging over multiple noise samples.

Distillation: making it tiny

Pocket TTS's final trick: distill a 24-layer teacher into a 6-layer student. The teacher runs with CFG (α=1.5), producing high-quality guided latents. The student is trained to match those guided latents without running two forward passes:

Teacher (24 layers): z_teacher = z_uncond + 1.5 · (z_cond − z_uncond)

Student (6 layers): z_student = student_backbone(input)

L_distill = ‖z_student − z_teacher‖²

Result: student produces CFG-quality output in a single pass,
with 4× fewer layers (and 4× less compute)

The student also gets a frozen copy of the teacher's MLP head. Only the backbone is re-trained. This means the student backbone must learn to output latents in the same space as the teacher's CFG-guided latents — a different distribution than what the MLP head was trained on, but close enough to work.

Gaussian Temperature Sampling

Temperature sampling is standard for discrete language models — divide logits by a temperature $T$ before softmax to control randomness. But Pocket TTS generates continuous vectors, so there are no logits.

The continuous equivalent: scale the noise variance. The LSD head takes Gaussian noise as input and produces a latent vector. Reducing the noise standard deviation by $\sqrtτ$ (where $τ$ is the temperature) produces less diverse but higher-quality outputs:

Standard: noise ~ 𝒩(0, 1)
With temperature τ = 0.7: noise ~ 𝒩(0, √0.7) = 𝒩(0, 0.837)

Lower temperature → noise closer to zero → output closer to the mode
→ cleaner, more predictable speech (at the cost of variety)

Temperature in TTS is a quality-diversity trade-off: 0.5 gives crisp, slightly over-enunciated speech. 1.0 gives natural variety but occasional artifacts. 0.7 (Pocket TTS default) is the sweet spot — clear without being robotic.

6. Putting It All Together

The three philosophies

	Kokoro	CSM	Pocket TTS
Philosophy	Pre-structure everything	Learn everything end-to-end	Remove the bottleneck
Parameters	82M	1,660M	100M
Audio repr.	Mel spectrogram	Discrete tokens (Mimi RVQ)	Continuous latents (VAE)
Text input	Phonemes (handcrafted G2P)	Subword tokens (learned)	Subword tokens (learned)
Generation	Non-autoregressive	Autoregressive (dual transformer)	Autoregressive (single step)
Voice cloning	Voicepacks (fixed set)	In-context (conversation history)	Audio prefix (5 sec)
Vocoder	iSTFTNet	Mimi decoder	VAE decoder
Training data	~200 hours	~1M hours	88K hours
CPU real-time	✓	✗	✓
WER	1.93	—	1.84

The evolution of TTS pipelines

2016 WaveNet : text → mel → sample-by-sample waveform (slow!) 2019 Tacotron 2 : text → mel spectrogram → WaveGlow vocoder 2021 VITS : text → mel → waveform (end-to-end, fast) 2022 EnCodec : audio ↔ discrete tokens (neural codec) 2023 VALL-E : text + codec tokens → language model → speech 2024 StyleTTS 2 : text → phonemes → style transfer → iSTFTNet 2024 Kokoro : StyleTTS 2 simplified — voicepacks + phonemes 2025 CSM : Llama backbone → conversational audio tokens 2025 F5-TTS : flow matching, non-autoregressive, zero-shot 2026 Pocket TTS : continuous latents, 1-step generation, CPU

The trend is clear: from explicit pipelines to learned everything, from big models to efficient ones, from discrete tokens to continuous representations. Five years ago, TTS required three separate models (text-to-phones, phones-to-mel, mel-to-wave). Today, a single 100M-parameter network goes from text to waveform.

What to try if you're getting started

Run Kokoro locally — pip install kokoro. It's tiny, fast, and great for understanding the phoneme → audio pipeline. Swap voicepacks to hear different speakers.
Run Pocket TTS — uvx pocket-tts serve. Record yourself for 5 seconds, and hear the model clone your voice. The wow factor of real-time CPU voice cloning is hard to beat.
Fine-tune all three architectures — the tts-training directory has Modal scripts for CSM-1B, F5-TTS, and Orpheus-3B, all training on the same dataset. Run them, then use compare_modal.py to hear the same sentences spoken by each model side by side. Nothing teaches architecture differences like hearing them.
Read the Mimi codec tutorial — Kyutai's codec explainer is excellent for building intuition about how audio gets compressed to tokens.
Experiment with temperature and CFG — on any model, try pushing temperature to 0.3 (robotic clarity) and 1.5 (chaotic naturalness). Try CFG at 1.0 (no guidance) and 5.0 (aggressive). The sweet spot is where the model sounds most human.

Key lessons

The codec is the foundation — every modern TTS system depends on compressing audio to a representation the language model can handle. The quality ceiling is set by the codec, not the LM. A perfect language model with a mediocre codec will sound mediocre.
Discrete vs. continuous is the central debate — discrete tokens let you use standard LM machinery (cross-entropy, sampling, beam search). Continuous latents avoid quantization loss but require new loss functions (flow matching, LSD) and new sampling strategies (Gaussian temperature).
Pre-structuring trades data for quality at small scale — Kokoro's phonemes and voicepacks compensate for having 5,000× less data than CSM. If you're training with limited data, phonemes and pre-computed speaker embeddings are your friends.
CFG is the quality knob — it works for discrete and continuous models, for images and audio, for conditioned and unconditioned generation. Understanding CFG deeply is transferable to any generative domain.
Distillation is the endgame — train a huge model with unlimited compute, then distill into a tiny model for deployment. Pocket TTS went from 24 layers to 6, from GPU-required to CPU-real-time, with minimal quality loss. This is the playbook for efficient AI in general.
LoRA makes fine-tuning accessible — 1,195 examples, 60 training steps, 15 minutes on one A100, ~$5. That's the barrier to entry for creating a custom TTS voice. The gap between "reading a paper" and "running it" has never been smaller.
Voice cloning is a spectrum — from Kokoro's frozen voicepacks (deterministic, fast, limited) through CSM's in-context learning (flexible, expensive, context-dependent) to Pocket TTS's audio prefix (direct, lightweight, faithful). The right approach depends on your use case.

Dev notes

This walkthrough covers Kokoro v0.19/v1.0 (82M params, HuggingFace), Sesame CSM (1.66B params, GitHub), Kyutai Pocket TTS (100M params, GitHub), and Orpheus-3B (GitHub). Key papers: Continuous Audio Language Models (Kyutai, 2025), DSM / Delayed Streams Modeling (Kyutai, 2025), F5-TTS (2024), StyleTTS 2 (2023). Companion training scripts: CSM, F5-TTS, and Orpheus fine-tuning on Modal with side-by-side comparison.

A voice speaks, and the air remembers.