There's something uncanny about hearing a machine speak in a voice indistinguishable from a human's. Not the robotic monotone of old GPS units, but the real thing — the breath, the hesitation, the warmth. In the last two years, text-to-speech has crossed from "good enough" to "wait, that's not a person?" This tutorial walks through how that works: from the physics of sound, through neural audio codecs, to the three architectures pushing the state of the art today.
We'll build the intuition layer by layer. By the end, you'll understand why a 100M-parameter model can clone your voice from 5 seconds of audio, why an 82M model can run on your phone, and why a 1.6B model can carry on a conversation that sounds genuinely alive.
0. Foundations: Audio
Before we can build a speech synthesizer, we need to understand what sound is to a computer. This is the equivalent of understanding pixels before you write a shader.
The waveform
Sound is pressure variation in air. A microphone converts that into a waveform — a one-dimensional signal measuring amplitude over time. At 24,000 Hz sample rate (the standard for modern TTS), one second of speech is a sequence of 24,000 floating-point numbers:
Each sᵢ ∈ [−1.0, 1.0] — air pressure at time i/24000 seconds
This is incredibly high-resolution. A 10-second utterance is 240,000 numbers. Predicting each one autoregressively — "given the first 23,999 samples, what's sample 24,000?" — is computationally suicidal. The sequences are too long, the dependencies too fine-grained. No language model can attend over 240k tokens efficiently.
Early neural TTS systems like WaveNet (2016) actually did predict one sample at a time. It worked beautifully but took minutes to generate one second of audio. The entire modern TTS revolution is about avoiding this bottleneck.The spectrogram
Humans don't hear individual samples — we hear frequencies. The Short-Time Fourier Transform (STFT) converts a waveform into a 2D representation: time on one axis, frequency on the other. Each cell contains the energy at that frequency during that time window:
w(n) = window function (Hann), hop = step size, N = FFT size
The spectrogram is a massive dimensionality reduction. Instead of 24,000 samples per second, we might get 100 time frames per second, each with 1,025 frequency bins. Still large, but manageable.
The mel spectrogram
Human hearing is logarithmic in frequency. We can distinguish between 200 Hz and 400 Hz easily, but 8,000 Hz and 8,200 Hz sound nearly identical. The mel scale warps the frequency axis to match human perception, and we apply triangular filter banks to compress 1,025 frequency bins down to 80 or 128 mel bins:
80 mel bins × 100 frames/sec = 8,000 values per second
This is 3× fewer values than the raw waveform. For years, mel spectrograms were the lingua franca of TTS — the model predicted a mel spectrogram from text, then a separate vocoder (like Griffin-Lim, WaveGlow, or HiFi-GAN) turned it back into a waveform.
The mel spectrogram throws away phase information — it only keeps magnitude. This is why you need a neural vocoder to reconstruct audio; you can't simply invert the mel spectrogram. The vocoder has to "hallucinate" plausible phase, which is why older TTS often sounded slightly metallic.The fundamental problem
Even mel spectrograms are too verbose for language models. 8,000 continuous values per second, over sequences of 10–30 seconds, means context windows of 80,000–240,000. Modern LLMs work best with discrete tokens and context lengths under 8,000. We need a way to compress speech into a very short sequence of tokens — maybe 12.5 per second. Enter: neural audio codecs.
1. The Codec
The neural audio codec is the breakthrough that made modern TTS possible. It's an autoencoder — a neural network that compresses audio to a tiny representation and reconstructs it back. But unlike JPEG or MP3, the compression happens in a learned latent space, and the representation can be either discrete tokens or continuous vectors.
Encoder → Bottleneck → Decoder
The architecture is conceptually simple. An encoder compresses raw audio into a sequence of latent vectors. A decoder reconstructs audio from those vectors. The bottleneck between them is where the magic happens — it forces the network to learn a compact, meaningful representation of sound:
The encoder uses strided convolutions — each layer reduces the time dimension by a factor (typically [8, 5, 4, 2] = 320× total downsampling). So 24,000 samples/sec ÷ 1920 ≈ 12.5 latent frames per second. Each frame is a vector of 128 or 256 dimensions — a compressed "snapshot" of ~80ms of audio.
Vector quantization
Here's the key insight: language models predict discrete tokens — integers from a vocabulary, like word IDs. Raw latent vectors are continuous (infinite possible values). To make audio compatible with language models, we need to quantize — snap each continuous vector to the nearest entry in a learned codebook:
quantize(z) = argmin_i ‖z − cᵢ‖²
The index i becomes the discrete token. The codebook entry cᵢ is the reconstructed vector.
This is exactly like color quantization in image processing — you have a palette of 1,024 "colors" (codebook vectors), and each pixel (latent frame) gets mapped to the closest palette entry. The codebook is learned during training via the commitment loss:
Python# Vector Quantization forward pass
def quantize(z, codebook):
# Find nearest codebook entry for each latent vector
distances = cdist(z, codebook) # [batch, seq, codebook_size]
indices = distances.argmin(dim=-1) # [batch, seq] — the discrete tokens!
z_q = codebook[indices] # quantized vectors
# Losses to train the codebook
commitment = mse(z.detach(), z_q) # move codebook toward encoder output
embedding = mse(z, z_q.detach()) # move encoder output toward codebook
# Straight-through estimator: forward uses z_q, backward uses z
z_q = z + (z_q - z).detach()
return z_q, indices, commitment + 0.25 * embedding
The straight-through estimator is a wonderful hack. argmin is non-differentiable — there's no gradient through "find the nearest codebook entry." So during the backward pass, we pretend quantization didn't happen and pass the gradient straight from the decoder to the encoder. It shouldn't work, but it does.
Residual Vector Quantization (RVQ)
A single codebook with 1,024 entries can only represent 10 bits of information per frame. That's roughly MP3-at-8kbps quality — barely intelligible. We need more bits, but simply making the codebook huge (e.g., 2²⁰ entries) makes lookup intractable.
RVQ solves this brilliantly: use multiple small codebooks in sequence. The first codebook captures the coarse shape of the audio. The second codebook quantizes the residual — the error left after the first. The third quantizes the error of the error. And so on:
Level 2: q₂ = quantize(z − q₁) → captures mid detail
Level 3: q₃ = quantize(z − q₁ − q₂) → captures fine detail
...
Level K: qₖ = quantize(z − Σᵢ₌₁ᵏ⁻¹ qᵢ) → captures residual detail
Reconstructed: ẑ = q₁ + q₂ + q₃ + … + qₖ
With 8 levels of 1,024-entry codebooks, we get 8 × 10 = 80 bits per frame. At 12.5 frames/second, that's 1,000 bits/second = 1 kbps. For comparison, MP3 uses 128 kbps. Neural codecs achieve comparable quality at 1/128th the bitrate.
Pythonclass ResidualVQ(nn.Module):
def __init__(self, num_levels=8, codebook_size=1024, dim=128):
self.levels = nn.ModuleList([
VectorQuantizer(codebook_size, dim)
for _ in range(num_levels)
])
def forward(self, z):
all_indices = []
residual = z
quantized_sum = 0
for level in self.levels:
q, indices, loss = level(residual)
quantized_sum += q
residual = z - quantized_sum # what's left to explain
all_indices.append(indices)
return quantized_sum, all_indices # 8 token streams!
This creates a natural hierarchy. The first RVQ level captures semantic content — what is being said, the rough melody of speech. Later levels capture acoustic detail — the speaker's timbre, room reverb, microphone characteristics. This split is crucial: it means a language model can predict the "meaning" of speech (level 1) separately from its "texture" (levels 2–8).
The codec zoo: EnCodec, SoundStream, Mimi
Three major codecs dominate the TTS landscape:
| Codec | From | Frame rate | RVQ levels | Architecture | Key trick |
|---|---|---|---|---|---|
| EnCodec | Meta, 2022 | 75 Hz | 8 | CNN + LSTM | Multi-scale discriminator |
| SoundStream | Google, 2021 | 50 Hz | 12 | CNN | Residual VQ (original) |
| Mimi | Kyutai, 2024 | 12.5 Hz | 8 | CNN + Transformer | Semantic distillation |
Mimi is the most interesting. At 12.5 Hz (one frame per 80ms), it produces far fewer tokens than EnCodec's 75 Hz. That's 6× fewer tokens for the language model to predict. Mimi achieves this extreme compression by adding a Transformer layer inside the codec and by distilling semantic knowledge from WavLM (a speech understanding model) into the first RVQ level:
Python# Mimi's semantic distillation loss
# Forces the first RVQ level to encode "what is being said"
# rather than acoustic details
wavlm_features = wavlm_encoder(audio) # semantic targets
codec_features = codec.encoder(audio) # learned features
first_level = codec_features[:1] # first RVQ level only
semantic_loss = 1.0 - cosine_similarity(
first_level, wavlm_features
) # push first level to encode meaning, not acoustics
The semantic distillation is a beautiful design choice. By forcing the first RVQ level to encode "meaning" (phonemes, prosody) and letting later levels handle "texture" (timbre, reverb), the language model only needs to nail the semantic tokens. A simpler decoder model can handle the acoustic refinement. This is exactly the split that CSM exploits with its dual-transformer design.
Going continuous: Pocket TTS's radical choice
Pocket TTS breaks from all of the above. Instead of discrete tokens, it predicts continuous latent vectors directly. The codec is modified to output Gaussian-distributed latents (like a VAE) instead of quantized codes:
Pocket TTS: audio → encoder → z ~ 𝒩(μ, σ²) → LM → continuous ẑ → decoder → audio
No codebook. No quantization. No information loss at the bottleneck.
Why? Because quantization is a bottleneck. Each codebook entry is a lossy approximation. When you shrink the model (fewer parameters), the language model struggles to predict tokens precisely — a wrong token in level 1 cascades into garbage audio. Continuous latents avoid this entirely. The model just needs to get "close enough" to the right vector, and the decoder smoothly interpolates.
The trade-off: you can't use standard cross-entropy loss (which requires discrete targets). Instead, Pocket TTS uses a flow matching loss with Lagrangian Self-Distillation (LSD), which we'll cover in Section 5.
2. The Language Model
Once we have a compact audio representation — whether discrete tokens or continuous latents — we need a model that can generate them from text. This is where the "language" in "language model" becomes literal: we're treating speech generation as a next-token prediction problem, just like GPT predicts the next word.
Autoregressive generation
The core loop is identical to text generation. Given a sequence of past audio tokens, predict the next one:
Python# Simplified autoregressive TTS generation
def generate_speech(text, model, codec, max_frames=500):
text_tokens = tokenize(text) # "Hello" → [4, 829, 1305]
audio_tokens = []
for step in range(max_frames):
# Concatenate text + generated audio so far
input_seq = text_tokens + audio_tokens
# Predict next audio token
logits = model(input_seq) # [vocab_size] probabilities
next_token = sample(logits) # temperature sampling
audio_tokens.append(next_token)
if next_token == EOS:
break
# Decode tokens back to audio
waveform = codec.decode(audio_tokens) # 500 tokens → 40 sec audio
return waveform
At 12.5 frames/second, a 10-second utterance is only 125 tokens — well within the attention window of any modern transformer. Compare that to 240,000 raw samples or 1,000 mel spectrogram frames. The codec is what made this tractable.
The multi-stream problem
With RVQ, we don't have one token per frame — we have K tokens (one per RVQ level). A frame at 12.5 Hz with 8 RVQ levels means 100 tokens per second. Naively flattening this to one long sequence works but is wasteful — level 8 (subtle acoustic detail) doesn't need the full power of a billion-parameter transformer.
Three strategies have emerged:
Delayed patterns (SoundStorm, CSM)
Offset each RVQ level by one timestep. At each position, the model sees level 1 of the current frame, level 2 of the previous frame, level 3 of the frame before that, etc. This allows all levels to be predicted in parallel while maintaining causal ordering:
Dual transformer (CSM)
A large backbone transformer handles text + first RVQ level (semantics). A smaller depth decoder predicts remaining levels (acoustics). The backbone runs once per frame; the decoder runs K times per frame but is much cheaper:
Python# CSM's dual-transformer approach
# Backbone: Llama-1B — sees text + semantic tokens
backbone_out = backbone(text_tokens + semantic_audio_tokens)
# Depth decoder: small transformer — predicts acoustic tokens
for level in range(1, 8): # levels 2-8
acoustic_token = depth_decoder(
backbone_out[-1], # conditioning from backbone
prev_level_tokens # tokens from levels 1..level-1
)
Single-step continuous (Pocket TTS)
Predict the entire continuous vector in one shot — no RVQ levels, no multi-step decoding. An MLP head takes the transformer's output and generates the full latent vector in a single forward pass.
Text conditioning
The model needs to know what to say. Different systems handle this differently:
- Phoneme input (Kokoro): Text is first converted to IPA phonemes —
"hello"→[h ə l oʊ]. This removes ambiguity (is "read" past or present tense?) and makes the modeling task easier. The trade-off: you need a handcrafted grapheme-to-phoneme (G2P) pipeline. - Subword input (CSM, Pocket TTS): Standard SentencePiece or BPE tokenizer, same as text LLMs. The model learns pronunciation implicitly. More flexible (handles names, code, foreign words) but requires more data.
3. Three Architectures
Now we have the building blocks. Let's see how three models assemble them into complete systems — each making fundamentally different design choices.
Kokoro: the efficient specialist
Kokoro (82M parameters) isn't a language model at all — it's a non-autoregressive system derived from StyleTTS 2. It predicts the entire mel spectrogram at once, then converts to audio with a specialized vocoder. No codecs, no tokens, no autoregressive loop:
Voicepacks instead of voice cloning
StyleTTS 2 uses a diffusion model to sample speaker styles at inference time — powerful but slow. Kokoro replaces this with pre-computed voicepacks: 512-dimensional style vectors extracted from reference audio and baked into files. At inference, you just load a voicepack — no iterative diffusion, no reference audio processing:
Python# Kokoro voice conditioning — voicepack is a frozen 512-d vector
voicepack = torch.load("voices/af_heart.pt") # [512]
# The style vector is injected into every layer via AdaIN
# (Adaptive Instance Normalization — same as in style transfer)
def ada_in(content, style):
mean, std = content.mean(), content.std()
style_mean, style_std = style_mlp(style) # learned projection
return style_std * (content - mean) / std + style_mean
Voicepacks are the key to Kokoro's speed. A diffusion-based style model might take 50 denoising steps per utterance. A voicepack is a single matrix multiply. The trade-off: you can't clone arbitrary voices at inference time — someone has to create the voicepack first. This is why the Pocket TTS blog post calls Kokoro's voice set "fixed."
iSTFTNet: math instead of neural upsampling
Traditional vocoders (HiFi-GAN) use learned transposed convolutions to upsample from mel spectrogram to waveform. Each upsampling layer doubles or quadruples the time resolution, adding parameters and compute.
Kokoro's iSTFTNet takes a shortcut: the network only needs to predict magnitude and phase at a lower resolution. Then the inverse STFT — a purely mathematical operation, no learnable parameters — reconstructs the full waveform:
(each upsample = transposed conv + activation + residual block)
iSTFTNet: mel → [lightweight net] → magnitude + phase → iSTFT → waveform
(iSTFT is a fixed matrix multiply, not learned)
The result: fewer parameters, less compute, and often better phase coherence because the iSTFT enforces physical consistency that neural upsampling doesn't.
CSM: the conversational model
CSM (Conversational Speech Model, 1.66B parameters) from Sesame AI Labs is fundamentally different. It's an autoregressive transformer that generates discrete audio tokens — essentially a text LLM repurposed for speech. Its unique strength is conversational context: it models multi-turn dialogue, adapting tone and emotion based on what was said before.
The dual transformer
CSM's backbone is Llama 3.2 1B — literally the same architecture as Meta's text LLM, trained from scratch on interleaved text and audio tokens. It processes both modalities in a unified sequence:
Python# CSM input sequence for a 2-turn conversation
input_sequence = [
# Turn 1: user's speech (text + audio tokens interleaved)
TEXT_START, *user_text_tokens, TEXT_END,
AUDIO_START, *user_audio_tokens_level1, AUDIO_END,
# Turn 2: assistant's response (text given, audio generated)
TEXT_START, *response_text_tokens, TEXT_END,
AUDIO_START, # model generates from here →
]
# Backbone predicts semantic tokens (RVQ level 1)
# Then depth decoder fills in acoustic detail (levels 2-8)
The key innovation: the backbone sees entire conversation history. If the previous turn was angry, the model learns to generate a cautious, conciliatory response. If the previous turn was a joke, the response might have a lighter cadence. This is what Sesame calls "voice presence" — the model doesn't just read text aloud; it performs it.
Compute amortization
Training the depth decoder on every frame of every RVQ level would be extremely expensive. CSM's trick: randomly sample 1/16th of frames for depth decoder training. The backbone still sees every frame (it needs full context), but the acoustic decoder only trains on a random subset. This cuts memory by ~6× with negligible quality loss.
Pocket TTS: the continuous rebel
Pocket TTS (100M parameters) from Kyutai is the newest and most radical. It throws away discrete tokens entirely, predicting continuous latent vectors with a single-step flow model. The result: a model 16× smaller than CSM that runs faster than real-time on a laptop CPU.
Why continuous works
Pocket TTS was distilled from Kyutai TTS 1.6B — a full-size model with RVQ tokens. The insight: when you try to make a discrete-token model smaller, the RQ-transformer (which predicts multiple RVQ levels) becomes the bottleneck. It's hard to shrink without losing quality. But if you remove quantization entirely, the model just needs to predict a single vector per frame — one forward pass through a small MLP:
Python# Pocket TTS: one MLP head replaces the entire RQ-transformer
class LSDHead(nn.Module):
"""Single-step flow matching head via Lagrangian Self-Distillation"""
def __init__(self, hidden_dim=512, latent_dim=128):
self.net = nn.Sequential(
nn.Linear(hidden_dim + latent_dim + 1, 512), # z + noise + t
nn.GELU(),
nn.Linear(512, 512),
nn.GELU(),
nn.Linear(512, latent_dim), # predict clean latent
)
def forward(self, transformer_output, noisy_latent, timestep):
# Predict the clean latent in ONE step
# No iterative denoising, no diffusion chain
x = torch.cat([transformer_output, noisy_latent, timestep], dim=-1)
return self.net(x) # → clean continuous latent
Model size breakdown
| Component | Params | Role |
|---|---|---|
| Causal transformer | 84M | Context modeling, text understanding |
| LSD head (MLP) | 6M | Single-step latent generation |
| VAE decoder | 10M | Latent → waveform |
| VAE encoder | 18M | Voice sample → prefix (inference only) |
| Total (generation) | 100M | Encoder not needed after first use |
For comparison: CSM's backbone alone is 1B parameters. Pocket TTS achieves comparable quality in 1/10th the parameters by (a) removing quantization overhead, (b) using a 6-layer transformer (distilled from 24), and (c) replacing multi-step decoding with a single MLP forward pass.
4. Voice Cloning & Conditioning
Making a model speak is one thing. Making it speak in your voice is another. This is the dimension where the three architectures diverge most sharply.
The voice identity problem
A voice has dozens of characteristics: pitch range, formant frequencies, speaking rate, breathiness, nasality, accent, vocal fry, the way consonants are released. Capturing all of this in a way a model can use requires either (a) a compact embedding that encodes these traits, or (b) a raw audio prefix that the model learns to imitate.
Strategy 1: Fixed voicepacks (Kokoro)
Kokoro uses pre-computed style vectors — 512 floating-point numbers that encode everything about a voice. These are extracted by running reference audio through a style encoder during training, then saved as .pt files:
This vector is frozen at inference time.
No voice cloning — only voices that have been pre-extracted.
Voicepacks are injected via Adaptive Instance Normalization (AdaIN) at every layer. This is the same technique used in neural style transfer — the content (phoneme sequence) passes through the network, and at each layer, the mean and variance of the activations are shifted to match the target style:
where μ_style, σ_style = MLP(voicepack)
Advantages: Fast, deterministic, no reference audio needed at inference. Limitations: Can't clone an unseen voice without re-extracting the style vector (which requires the style encoder and isn't part of the public model).
Strategy 2: Conversational context (CSM)
CSM doesn't have explicit speaker embeddings. Instead, voice identity is carried by the audio token history. When you provide audio context — previous turns of a conversation — the model attends to those tokens and implicitly extracts the speaker's characteristics:
Python# CSM "voice cloning" via context
# Provide a reference utterance as conversation history
context = [
Segment(
speaker=0,
text="This is my natural speaking voice.",
audio=reference_audio # 5-10 seconds of target voice
)
]
# Generate in the same voice
output = csm.generate(
text="Now I'll say something new.",
speaker=0, # same speaker ID as context
context=context # model attends to these tokens
)
This is in-context learning — the same mechanism that lets GPT-4 follow instructions without fine-tuning. The model has seen millions of conversations during training and learned to match speaking style within a conversation. The quality depends heavily on the reference audio: more context = better voice matching.
Strategy 3: Audio prefix (Pocket TTS)
Pocket TTS uses the most direct approach: encode the reference audio with the codec encoder, and prepend it to the generation sequence. The model simply continues generating audio that sounds like the prefix:
Python# Pocket TTS voice cloning
reference_latents = codec.encoder(reference_audio) # [~62 frames for 5 sec]
text_tokens = tokenizer(text) # [~20 tokens]
# Input: [voice_latents... text_tokens... ]
# The model generates audio latents that continue the voice
generated_latents = model.generate(
prefix=reference_latents, # voice identity
text=text_tokens, # what to say
temperature=0.7, # Gaussian temperature
cfg_alpha=1.5 # classifier-free guidance
)
audio = codec.decode(generated_latents)
Because the voice conditioning lives in the same continuous space as the generation target, the model naturally captures everything about the reference: voice color, emotion, accent, cadence, room acoustics, microphone characteristics. Kyutai reports that 5 seconds of reference audio is sufficient for faithful voice reproduction.
Pocket TTS's approach is elegant: the codec encoder (18M params) only runs once per voice sample. After that, the voice embedding can be cached and reused for unlimited generations. The generation path (100M params) never needs to see the raw reference audio again.Strategy 4: Fine-tuning with LoRA
All three models can be adapted to a specific voice through fine-tuning, but the most practical approach is LoRA (Low-Rank Adaptation). Instead of updating all parameters, LoRA adds small trainable matrices to the attention layers:
LoRA: y = Wx + BAx (B ∈ ℝᵈˣʳ, A ∈ ℝʳˣᵈ, r ≪ d)
With r=32 and d=2048: W has 4M params, BA has 131K params (3.2%)
Python# LoRA config for CSM fine-tuning (from csm-finetune report)
from peft import LoraConfig
config = LoraConfig(
r=32, # rank — controls capacity vs efficiency
lora_alpha=32, # scaling factor (usually = r)
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj", # attention
"gate_proj", "up_proj", "down_proj", # FFN
],
lora_dropout=0.0,
)
# Result: 29M trainable params out of 1.66B total (1.75%)
# Peak VRAM: 6.2 GB on A100 (vs ~40 GB for full fine-tuning)
# Adapter size: ~100 MB (vs multi-GB for full weights)
LoRA fine-tuning for TTS is remarkably data-efficient. For CSM, the Elise voice dataset contains only 1,195 examples. Training for 60 steps with batch size 8 takes ~15 minutes on an A100. The resulting voice is noticeably different from the base model — the speaker identity shifts toward the target while the model retains its ability to speak naturally.
5. Training: Data, Compute & Tricks
You have the architecture. Now: how do you actually train one of these things? The data requirements, loss functions, and training tricks are where the real craft lies.
How much data?
The data requirements vary by over three orders of magnitude:
| Model | Training data | Hours | Cost estimate |
|---|---|---|---|
| Kokoro v0.19 | LibriVox + synthetic | <100 hours | ~$400–$1,000 (A100s) |
| Kokoro v1.0 | LibriVox + Apache + synthetic | ~200–300 hours | ~$1,000–$2,000 |
| Pocket TTS | AMI, GigaSpeech, Emilia, etc. | 88,000 hours | Not disclosed |
| CSM | Public transcribed English | ~1,000,000 hours | Not disclosed (H100s) |
| LoRA fine-tune | Single speaker dataset | 1–10 hours | ~$5–$50 (A100 spot) |
The pattern: architectural efficiency trades against data requirements. Kokoro uses phonemes and voicepacks to pre-structure the problem, so it needs less data to learn the remaining mapping. CSM uses raw text tokens and learns everything end-to-end — more flexible, but it needs 10,000× more data.
Pocket TTS at 88,000 hours is interesting — it's purely public datasets, deliberately chosen for reproducibility. Kyutai notes they're "excited to see how far the method can be pushed with additional private data." The 88k hours is likely far below the model's capacity.Dataset composition matters
It's not just about hours — it's about diversity. Pocket TTS uses 8 different datasets specifically to cover different domains:
- AMI — meeting recordings (natural conversation, overlapping speech)
- EARNINGS22 — earnings calls (formal speech, financial terminology)
- GigaSpeech — podcasts and YouTube (casual, varied acoustics)
- SPGISpeech — financial transcripts (clear enunciation)
- TED-LIUM — TED talks (public speaking, emotional range)
- VoxPopuli — EU Parliament (accented English)
- LibriHeavy — audiobooks (expressive reading)
- Emilia — large-scale diverse speech (2024, multi-domain)
A model trained only on audiobooks would sound great reading poetry but collapse on conversational speech. A model trained only on phone calls would lack expressiveness. The mix is critical.
Loss functions for TTS
Different architectures use different losses, but they share a common theme: reconstruct audio well, while also fooling a discriminator.
Codec training losses
Python# Training a neural audio codec (EnCodec, Mimi, etc.)
L_total = (
# 1. Reconstruction: spectrogram should match
L_mel # L1 loss on mel spectrogram (perceptual)
+ L_stft # multi-scale STFT loss (time-frequency)
# 2. Adversarial: fool the discriminator
+ L_adv # GAN loss (make discriminator think output is real)
+ L_feat # feature matching loss (match discriminator internals)
# 3. Quantization: stable codebooks
+ L_commit # commitment loss (codebook ↔ encoder alignment)
# 4. Semantic (Mimi only): first level should encode meaning
+ L_semantic # cosine similarity with WavLM features
)
Language model losses
For discrete-token models (CSM), the loss is just cross-entropy — the same as GPT:
Exactly the same as language model pre-training.
For continuous-latent models (Pocket TTS), the loss is flow matching + LSD:
Lagrangian Self-Distillation (LSD)
Standard flow matching trains a model to predict the velocity field of a diffusion process — moving from noise to clean audio over many steps. The problem: you need 50–100 steps at inference time. LSD makes the model predict the clean output in one step by adding a consistency constraint:
v_θ(x_t, t) predicts direction from noisy x_t toward clean x₀
Inference: iterate 50+ steps from x₁ (noise) to x₀ (audio)
LSD addition:
Also train a "shortcut" that goes directly from x_t to x₀
The shortcut must agree with the iterative path (self-consistency)
Inference: ONE step from noise to audio
The "Lagrangian" part: instead of hard-constraining the consistency (which makes optimization unstable), a Lagrange multiplier is introduced that automatically balances flow matching accuracy against single-step shortcut quality. The result: Pocket TTS generates one latent vector per forward pass — no iterative refinement at all.
Classifier-Free Guidance (CFG)
CFG is borrowed from image generation (Stable Diffusion uses it heavily). The idea: during training, randomly drop the conditioning (text/voice) some fraction of the time. At inference, run the model twice — once with conditioning, once without — and amplify the difference:
α = 1.0: no guidance (use conditioned output as-is)
α = 1.5: moderate guidance (Pocket TTS default)
α = 3.0: strong guidance (clearer but less natural)
α = 7.0+: over-guided (robotic, distorted)
Python# CFG during inference
def generate_with_cfg(model, text, voice, alpha=1.5):
# Forward pass WITH conditioning
z_cond = model(text=text, voice=voice)
# Forward pass WITHOUT conditioning (text and voice masked)
z_uncond = model(text=None, voice=None)
# Extrapolate away from unconditioned toward conditioned
z_guided = z_uncond + alpha * (z_cond - z_uncond)
return z_guided
The cost: 2× inference compute (two forward passes). Pocket TTS solves this with distillation — train a student model that produces CFG-quality output in a single pass (see below).
CFG is surprisingly subtle in the audio domain. For images, guidance pushes toward more "typical" outputs (sharper, more saturated). For speech, it pushes toward clearer pronunciation and more distinctive speaker characteristics — but too much makes the voice sound hyperarticulated and unnatural, like a news anchor having a breakdown.Pocket TTS's Latent CFG
Standard CFG operates on the model's output — the predicted audio token or waveform. But Pocket TTS predicts continuous latents via a one-step flow model, and interpolating in output space for one-step models doesn't work (you'd just be averaging two waveforms, which is layering sounds on top of each other).
Pocket TTS's innovation: apply CFG in the transformer's latent space — on the hidden states z, not the generated audio x:
x = LSD_head(z_cfg) ← generate audio from guided latent
The guided latent z_cfg might be out-of-distribution for the LSD head (it was trained on normal latents, not extrapolated ones). Surprisingly, it still works — the LSD head is robust enough to handle the shifted input, and the quality improvement is significant.
Head Batch Multiplier
Training with flow matching is bottlenecked by the transformer backbone — it's much larger than the MLP head. Pocket TTS's trick: reuse each backbone output 8 times:
Python# Head Batch Multiplier (N=8)
z = transformer(input_sequence) # expensive! run once per batch
total_loss = 0
for i in range(8): # reuse z eight times
noise = torch.randn_like(target_latent) # fresh noise each time
t = torch.rand(1) # fresh timestep
predicted = lsd_head(z, noise, t)
total_loss += flow_matching_loss(predicted, target_latent)
loss = total_loss / 8 # amortized over 8 samples
Each of the 8 iterations uses different random noise and timesteps, so the LSD head sees diverse training signal from a single backbone pass. This is 8× more efficient and also stabilizes training by averaging over multiple noise samples.
Distillation: making it tiny
Pocket TTS's final trick: distill a 24-layer teacher into a 6-layer student. The teacher runs with CFG (α=1.5), producing high-quality guided latents. The student is trained to match those guided latents without running two forward passes:
Student (6 layers): z_student = student_backbone(input)
L_distill = ‖z_student − z_teacher‖²
Result: student produces CFG-quality output in a single pass,
with 4× fewer layers (and 4× less compute)
The student also gets a frozen copy of the teacher's MLP head. Only the backbone is re-trained. This means the student backbone must learn to output latents in the same space as the teacher's CFG-guided latents — a different distribution than what the MLP head was trained on, but close enough to work.
Gaussian Temperature Sampling
Temperature sampling is standard for discrete language models — divide logits by a temperature T before softmax to control randomness. But Pocket TTS generates continuous vectors, so there are no logits.
The continuous equivalent: scale the noise variance. The LSD head takes Gaussian noise as input and produces a latent vector. Reducing the noise standard deviation by √τ (where τ is the temperature) produces less diverse but higher-quality outputs:
With temperature τ = 0.7: noise ~ 𝒩(0, √0.7) = 𝒩(0, 0.837)
Lower temperature → noise closer to zero → output closer to the mode
→ cleaner, more predictable speech (at the cost of variety)
6. Putting It All Together
The three philosophies
| Kokoro | CSM | Pocket TTS | |
|---|---|---|---|
| Philosophy | Pre-structure everything | Learn everything end-to-end | Remove the bottleneck |
| Parameters | 82M | 1,660M | 100M |
| Audio repr. | Mel spectrogram | Discrete tokens (Mimi RVQ) | Continuous latents (VAE) |
| Text input | Phonemes (handcrafted G2P) | Subword tokens (learned) | Subword tokens (learned) |
| Generation | Non-autoregressive | Autoregressive (dual transformer) | Autoregressive (single step) |
| Voice cloning | Voicepacks (fixed set) | In-context (conversation history) | Audio prefix (5 sec) |
| Vocoder | iSTFTNet | Mimi decoder | VAE decoder |
| Training data | ~200 hours | ~1M hours | 88K hours |
| CPU real-time | ✓ | ✗ | ✓ |
| WER | 1.93 | — | 1.84 |
The evolution of TTS pipelines
The trend is clear: from explicit pipelines to learned everything, from big models to efficient ones, from discrete tokens to continuous representations. Five years ago, TTS required three separate models (text-to-phones, phones-to-mel, mel-to-wave). Today, a single 100M-parameter network goes from text to waveform.
What to try if you're getting started
- Run Kokoro locally —
pip install kokoro. It's tiny, fast, and great for understanding the phoneme → audio pipeline. Swap voicepacks to hear different speakers. - Run Pocket TTS —
uvx pocket-tts serve. Record yourself for 5 seconds, and hear the model clone your voice. The wow factor of real-time CPU voice cloning is hard to beat. - Fine-tune all three architectures — the tts-training directory has Modal scripts for CSM-1B, F5-TTS, and Orpheus-3B, all training on the same dataset. Run them, then use
compare_modal.pyto hear the same sentences spoken by each model side by side. Nothing teaches architecture differences like hearing them. - Read the Mimi codec tutorial — Kyutai's codec explainer is excellent for building intuition about how audio gets compressed to tokens.
- Experiment with temperature and CFG — on any model, try pushing temperature to 0.3 (robotic clarity) and 1.5 (chaotic naturalness). Try CFG at 1.0 (no guidance) and 5.0 (aggressive). The sweet spot is where the model sounds most human.
Key lessons
- The codec is the foundation — every modern TTS system depends on compressing audio to a representation the language model can handle. The quality ceiling is set by the codec, not the LM. A perfect language model with a mediocre codec will sound mediocre.
- Discrete vs. continuous is the central debate — discrete tokens let you use standard LM machinery (cross-entropy, sampling, beam search). Continuous latents avoid quantization loss but require new loss functions (flow matching, LSD) and new sampling strategies (Gaussian temperature).
- Pre-structuring trades data for quality at small scale — Kokoro's phonemes and voicepacks compensate for having 5,000× less data than CSM. If you're training with limited data, phonemes and pre-computed speaker embeddings are your friends.
- CFG is the quality knob — it works for discrete and continuous models, for images and audio, for conditioned and unconditioned generation. Understanding CFG deeply is transferable to any generative domain.
- Distillation is the endgame — train a huge model with unlimited compute, then distill into a tiny model for deployment. Pocket TTS went from 24 layers to 6, from GPU-required to CPU-real-time, with minimal quality loss. This is the playbook for efficient AI in general.
- LoRA makes fine-tuning accessible — 1,195 examples, 60 training steps, 15 minutes on one A100, ~$5. That's the barrier to entry for creating a custom TTS voice. The gap between "reading a paper" and "running it" has never been smaller.
- Voice cloning is a spectrum — from Kokoro's frozen voicepacks (deterministic, fast, limited) through CSM's in-context learning (flexible, expensive, context-dependent) to Pocket TTS's audio prefix (direct, lightweight, faithful). The right approach depends on your use case.
Dev notes
This walkthrough covers Kokoro v0.19/v1.0 (82M params, HuggingFace), Sesame CSM (1.66B params, GitHub), Kyutai Pocket TTS (100M params, GitHub), and Orpheus-3B (GitHub). Key papers: Continuous Audio Language Models (Kyutai, 2025), DSM / Delayed Streams Modeling (Kyutai, 2025), F5-TTS (2024), StyleTTS 2 (2023). Companion training scripts: CSM, F5-TTS, and Orpheus fine-tuning on Modal with side-by-side comparison.
A voice speaks, and the air remembers.