AlexTTS

A text-to-speech system built from scratch using a decoder-only autoregressive transformer

AlexTTS is a text-to-speech system built from scratch, leveraging a decoder-only autoregressive transformer implemented via Meta’s Lingua framework. Given a text prompt, the model synthesizes speech.

View project page →

High-level architecture of AlexTTS. The model takes raw audio and text as input and autoregressively decodes speech tokens which are then converted back to audio. Note that the reference was not used in the code.

How it works

The pipeline has three main stages:

  • Pre-processing: Raw audio is tokenized via a speech tokenizer (e.g. DAC), while text is converted to phonemes using a G2P phonemizer (Misaki).
  • Training: The autoregressive transformer is trained with cross-entropy loss between predicted and ground-truth speech tokens, conditioned on text and speech embeddings.
  • Inference: The model autoregressively generates speech tokens conditioned on the text embeddings, which are then decoded back to audio by the speech tokenizer.