Table of Contents
Fetching ...

Scaling Transformers for Low-Bitrate High-Quality Speech Coding

Julian D Parker, Anton Smirnov, Jordi Pons, CJ Carr, Zack Zukowski, Zach Evans, Xubo Liu

TL;DR

This paper tackles ultra-low-bitrate speech coding by scaling a transformer-based autoencoder and coupling it with a flexible FSQ bottleneck. The Transformer Audio AutoEncoder (TAAE) uses a large, predominantly transformer architecture with a novel FSQ-based bottleneck and post-hoc residual quantization to achieve state-of-the-art quality at 400 and 700 bps, outperforming strong baselines in both objective metrics and human listening tests. The authors introduce a two-stage training regime with a discriminator-based adversarial objective and a perceptual fine-tune, demonstrate robust performance across languages, and show that the approach can be adapted for streaming with competitive latency. The work highlights the potential of large-scale transformer codecs for high-quality, low-bitrate speech in generative and multimodal pipelines, and provides scaling evidence, multilingual generalization, and practical deployment considerations.

Abstract

The tokenization of speech with neural audio codec models is a vital part of modern AI pipelines for the generation or understanding of speech, alone or in a multimodal context. Traditionally such tokenization models have concentrated on low parameter-count architectures using only components with strong inductive biases. In this work we show that by scaling a transformer architecture with large parameter count to this problem, and applying a flexible Finite Scalar Quantization (FSQ) based bottleneck, it is possible to reach state-of-the-art speech quality at extremely low bit-rates of $400$ or $700$ bits-per-second. The trained models strongly out-perform existing baselines in both objective and subjective tests.

Scaling Transformers for Low-Bitrate High-Quality Speech Coding

TL;DR

This paper tackles ultra-low-bitrate speech coding by scaling a transformer-based autoencoder and coupling it with a flexible FSQ bottleneck. The Transformer Audio AutoEncoder (TAAE) uses a large, predominantly transformer architecture with a novel FSQ-based bottleneck and post-hoc residual quantization to achieve state-of-the-art quality at 400 and 700 bps, outperforming strong baselines in both objective metrics and human listening tests. The authors introduce a two-stage training regime with a discriminator-based adversarial objective and a perceptual fine-tune, demonstrate robust performance across languages, and show that the approach can be adapted for streaming with competitive latency. The work highlights the potential of large-scale transformer codecs for high-quality, low-bitrate speech in generative and multimodal pipelines, and provides scaling evidence, multilingual generalization, and practical deployment considerations.

Abstract

The tokenization of speech with neural audio codec models is a vital part of modern AI pipelines for the generation or understanding of speech, alone or in a multimodal context. Traditionally such tokenization models have concentrated on low parameter-count architectures using only components with strong inductive biases. In this work we show that by scaling a transformer architecture with large parameter count to this problem, and applying a flexible Finite Scalar Quantization (FSQ) based bottleneck, it is possible to reach state-of-the-art speech quality at extremely low bit-rates of or bits-per-second. The trained models strongly out-perform existing baselines in both objective and subjective tests.

Paper Structure

This paper contains 42 sections, 15 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Architecture of the proposed model. Detail is shown for the encoder block and sub-blocks. The decoder block is configured identically to the encoder block, with the exception of the strided convolution, which is replaced with its transposed equivalent and moved to the end of the $T_m$ blocks.
  • Figure 2: Results of MUSHRA test.
  • Figure 3: Objective metrics for the TAAE and baselines, evaluated on utterances from length $3$s to $25$s, showing generalization of models across lengths. In cases where a baseline has multiple bitrate versions evaluated in this work, the higher bitrate is evaluated here.
  • Figure 4: Demographic breakdowns of the perceptual test.