Table of Contents
Fetching ...

Frame-Stacked Local Transformers For Efficient Multi-Codebook Speech Generation

Roy Fejgin, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Ryan Langman, Jaehyeon Kim, Subhankar Ghosh, Shehzeen Hussain, Jason Li

TL;DR

The study tackles efficient speech generation with discrete acoustic codes arranged as a two-dimensional grid $[T,N]$, where intra-timestep dependencies challenge parallel decoding. It introduces two Local Transformer heads for iterative multi-codebook prediction—an autoregressive LT and a MaskGIT-based LT—and couples them with frame stacking to boost throughput. Experiments on LibriTTS show that LT-based iterative decoding improves distributional fidelity (lower FD) and preserves intelligibility and speaker similarity while delivering substantial speedups (e.g., AR LT at ~2x and MaskGIT at ~3x with stacking). The work also provides practical guidelines for choosing decoding strategies based on deployment priorities, balancing synthesis fidelity and computational efficiency.

Abstract

Speech generation models based on large language models (LLMs) typically operate on discrete acoustic codes, which differ fundamentally from text tokens due to their multicodebook structure. At each timestep, models must predict N codebook entries jointly, introducing dependencies that challenge simple parallel prediction approaches. Parallel prediction assumes independence among codebooks, yielding efficient decoding but often at the cost of reduced fidelity. To address this, hierarchical strategies employ a local transformer (LT) to refine predictions and capture intra-timestep dependencies. In this work, we systematically investigate two LT architectures: an autoregressive transformer that generates codebooks sequentially, and a MaskGIT-based transformer that performs iterative masked prediction. Both designs further enable frame stacking, where the primary transformer predicts multiple frames jointly, and the LT decodes their codebooks, offering improvements in speed without compromising perceptual quality. Through extensive analysis, we characterize the tradeoffs between parallel and iterative sampling strategies across different throughput and quality regimes. Finally, we propose practical guidelines for selecting decoding strategies based on deployment priorities such as computational efficiency and synthesis fidelity.

Frame-Stacked Local Transformers For Efficient Multi-Codebook Speech Generation

TL;DR

The study tackles efficient speech generation with discrete acoustic codes arranged as a two-dimensional grid , where intra-timestep dependencies challenge parallel decoding. It introduces two Local Transformer heads for iterative multi-codebook prediction—an autoregressive LT and a MaskGIT-based LT—and couples them with frame stacking to boost throughput. Experiments on LibriTTS show that LT-based iterative decoding improves distributional fidelity (lower FD) and preserves intelligibility and speaker similarity while delivering substantial speedups (e.g., AR LT at ~2x and MaskGIT at ~3x with stacking). The work also provides practical guidelines for choosing decoding strategies based on deployment priorities, balancing synthesis fidelity and computational efficiency.

Abstract

Speech generation models based on large language models (LLMs) typically operate on discrete acoustic codes, which differ fundamentally from text tokens due to their multicodebook structure. At each timestep, models must predict N codebook entries jointly, introducing dependencies that challenge simple parallel prediction approaches. Parallel prediction assumes independence among codebooks, yielding efficient decoding but often at the cost of reduced fidelity. To address this, hierarchical strategies employ a local transformer (LT) to refine predictions and capture intra-timestep dependencies. In this work, we systematically investigate two LT architectures: an autoregressive transformer that generates codebooks sequentially, and a MaskGIT-based transformer that performs iterative masked prediction. Both designs further enable frame stacking, where the primary transformer predicts multiple frames jointly, and the LT decodes their codebooks, offering improvements in speed without compromising perceptual quality. Through extensive analysis, we characterize the tradeoffs between parallel and iterative sampling strategies across different throughput and quality regimes. Finally, we propose practical guidelines for selecting decoding strategies based on deployment priorities such as computational efficiency and synthesis fidelity.

Paper Structure

This paper contains 15 sections, 2 figures.

Figures (2)

  • Figure 1: Model Architecture
  • Figure 2: Evaluation Results on LibriTTS. 1x, 2x, 4x in the model names refer to the number of frames stacked. (a) LT models achieve same or better UTMOSv2 scores than baseline, except the 4x-stacked MaskGit LT. (b) WERs are similar for all models, with CIs overlapping. (c)(d) SSIMs: at 1x stacking, LT models have an advantage; at 2x, LT models are similar to baseline; at 4x, LT models are still usable for seen speakers, but with reduced robustness to unseen speakers. (e) FDs: LT models consistently exhibit a strong advantage over parallel sampling. (f) Significant inference speedup from frame stacking. (g) Metrics table.