Table of Contents
Fetching ...

Language Model Based Text-to-Audio Generation: Anti-Causally Aligned Collaborative Residual Transformers

Juncheng Wang, Chao Xu, Cheng Yu, Zhe Hu, Haoyu Xie, Guoqi Yu, Lei Shang, Shujun Wang

TL;DR

Siren is proposed, a novel LM-based framework that employs multiple isolated transformers with causal conditioning and anti-causal alignment via reinforcement learning that outperforms both existing LM-based and diffusion-based T2A systems, achieving state-of-the-art results.

Abstract

While language models (LMs) paired with residual vector quantization (RVQ) tokenizers have shown promise in text-to-audio (T2A) generation, they still lag behind diffusion-based models by a non-trivial margin. We identify a critical dilemma underpinning this gap: incorporating more RVQ layers improves audio reconstruction fidelity but exceeds the generation capacity of conventional LMs. To address this, we first analyze RVQ dynamics and uncover two key limitations: 1) orthogonality of features across RVQ layers hinders effective LMs training, and 2) descending semantic richness in tokens from deeper RVQ layers exacerbates exposure bias during autoregressive decoding. Based on these insights, we propose Siren, a novel LM-based framework that employs multiple isolated transformers with causal conditioning and anti-causal alignment via reinforcement learning. Extensive experiments demonstrate that Siren outperforms both existing LM-based and diffusion-based T2A systems, achieving state-of-the-art results. By bridging the representational strengths of LMs with the fidelity demands of audio synthesis, our approach repositions LMs as competitive contenders against diffusion models in T2A tasks. Moreover, by aligning audio representations with linguistic structures, Siren facilitates a promising pathway toward unified multi-modal generation frameworks.

Language Model Based Text-to-Audio Generation: Anti-Causally Aligned Collaborative Residual Transformers

TL;DR

Siren is proposed, a novel LM-based framework that employs multiple isolated transformers with causal conditioning and anti-causal alignment via reinforcement learning that outperforms both existing LM-based and diffusion-based T2A systems, achieving state-of-the-art results.

Abstract

While language models (LMs) paired with residual vector quantization (RVQ) tokenizers have shown promise in text-to-audio (T2A) generation, they still lag behind diffusion-based models by a non-trivial margin. We identify a critical dilemma underpinning this gap: incorporating more RVQ layers improves audio reconstruction fidelity but exceeds the generation capacity of conventional LMs. To address this, we first analyze RVQ dynamics and uncover two key limitations: 1) orthogonality of features across RVQ layers hinders effective LMs training, and 2) descending semantic richness in tokens from deeper RVQ layers exacerbates exposure bias during autoregressive decoding. Based on these insights, we propose Siren, a novel LM-based framework that employs multiple isolated transformers with causal conditioning and anti-causal alignment via reinforcement learning. Extensive experiments demonstrate that Siren outperforms both existing LM-based and diffusion-based T2A systems, achieving state-of-the-art results. By bridging the representational strengths of LMs with the fidelity demands of audio synthesis, our approach repositions LMs as competitive contenders against diffusion models in T2A tasks. Moreover, by aligning audio representations with linguistic structures, Siren facilitates a promising pathway toward unified multi-modal generation frameworks.

Paper Structure

This paper contains 42 sections, 15 equations, 9 figures, 8 tables, 3 algorithms.

Figures (9)

  • Figure 1: (a) Residual Vector Quantization Process, where a waveform is quantized into $r\times l$ discrete tokens. With $r=1$, it degenerates to naive VQ. (b) Performance changing curves as $r$ going larger.
  • Figure 2: (a) Cosine distributions between quantized features from different RVQ layer. The Interval here means the layer interval between two paired features. (b) The angle distribution between gradient vectors that different classifier heads backpropagated to the shared transformer. (c) T-SNE processed quantized features from different RVQ layer, where each audio class has been colored, and corresponding accuracy is on the top. (d) Convergence curves of learning different RVQ codes. Due to page limitation, we have placed the whole layer results in the Appendix. Please zoom in for more details.
  • Figure 3: The pipeline of the proposed Siren. Due to spatial restriction, we simplify our real setting ($r=12$) into a case that $r=4$ where $r/2=2$ parallel transformers are fed with the same input to conduct next-time feature prediction, which are then factorized into corresponding two adjacent RVQ codes through independent decoders and classifiers. Then, the predicted tokens can be concatenated together and fed into de-tokenizer to recover into audio waveform. In training phase, teacher forcing is adopted where the contextual tokens are all the ground-truth.
  • Figure 4: Overall pipeline for reinforcement learning. Given a prompt, first stage trained models generate whole audio tokens, where the output of the first transformer is used as action to align with the conditional preference of other models by an audio quality reward.
  • Figure 5: Rewards (ImageBind Cosine) changing curves among different reinforcement learning strategy. The figure pair aside each curve means (FD, CLAP Score).
  • ...and 4 more figures