Table of Contents
Fetching ...

From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training

Tianqiao Liu, Xueyi Li, Hao Wang, Haoxuan Li, Zhichao Chen, Weiqi Luo, Zitao Liu

TL;DR

This work tackles the mismatch between text and audio generation in end‑to‑end audio–language models by proposing Text‑to‑Talk (TtT), a unified Transformer that uses autoregressive (AR) decoding for text and non‑autoregressive (NAR) absorbing discrete diffusion for audio. By leveraging a partial‑order factorization and order‑marginalization, the authors provide a principled joint objective that upper‑bounds the true joint distribution, while three training strategies mitigate train–test discrepancies. Empirical results across Audio‑QA, ASR, AAC, and speech‑to‑speech benchmarks show that TtT consistently outperforms strong AR and NAR baselines, and multimodal pretraining further enhances performance. The approach enables parallel audio generation with controlled latency and demonstrates strong cross‑modal alignment, making it a scalable path toward real‑world end‑to‑end speech–text systems, with code and data to be released.

Abstract

Recent advances in large language models (LLMs) have attracted significant interest in extending their capabilities to multimodal scenarios, particularly for speech-to-speech conversational systems. However, existing multimodal models handling interleaved audio and text rely on autoregressive (AR) methods, overlooking that text depends on target-target relations whereas audio depends mainly on source-target relations. In this work, we propose Text-to-Talk (TtT), a unified audio-text framework that integrates AR text generation with non-autoregressive (NAR) audio diffusion in a single Transformer. By leveraging the any-order AR property of absorbing discrete diffusion, our approach provides a unified training objective for text and audio. To support this hybrid generation paradigm, we design a modality-aware attention mechanism that enforces causal decoding for text while allowing bidirectional modeling within audio spans, and further introduce three training strategies that reduce train-test discrepancies. During inference, TtT employs block-wise diffusion to synthesize audio in parallel while flexibly handling variable-length outputs. Comprehensive experiments on Audio-QA, ASR, AAC and speech-to-speech benchmarks show that TtT consistently surpasses strong AR and NAR baselines, with additional ablation and training-strategy analyses confirming the contribution of each component. We will open-source our models, data and code to facilitate future research in this direction.

From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training

TL;DR

This work tackles the mismatch between text and audio generation in end‑to‑end audio–language models by proposing Text‑to‑Talk (TtT), a unified Transformer that uses autoregressive (AR) decoding for text and non‑autoregressive (NAR) absorbing discrete diffusion for audio. By leveraging a partial‑order factorization and order‑marginalization, the authors provide a principled joint objective that upper‑bounds the true joint distribution, while three training strategies mitigate train–test discrepancies. Empirical results across Audio‑QA, ASR, AAC, and speech‑to‑speech benchmarks show that TtT consistently outperforms strong AR and NAR baselines, and multimodal pretraining further enhances performance. The approach enables parallel audio generation with controlled latency and demonstrates strong cross‑modal alignment, making it a scalable path toward real‑world end‑to‑end speech–text systems, with code and data to be released.

Abstract

Recent advances in large language models (LLMs) have attracted significant interest in extending their capabilities to multimodal scenarios, particularly for speech-to-speech conversational systems. However, existing multimodal models handling interleaved audio and text rely on autoregressive (AR) methods, overlooking that text depends on target-target relations whereas audio depends mainly on source-target relations. In this work, we propose Text-to-Talk (TtT), a unified audio-text framework that integrates AR text generation with non-autoregressive (NAR) audio diffusion in a single Transformer. By leveraging the any-order AR property of absorbing discrete diffusion, our approach provides a unified training objective for text and audio. To support this hybrid generation paradigm, we design a modality-aware attention mechanism that enforces causal decoding for text while allowing bidirectional modeling within audio spans, and further introduce three training strategies that reduce train-test discrepancies. During inference, TtT employs block-wise diffusion to synthesize audio in parallel while flexibly handling variable-length outputs. Comprehensive experiments on Audio-QA, ASR, AAC and speech-to-speech benchmarks show that TtT consistently surpasses strong AR and NAR baselines, with additional ablation and training-strategy analyses confirming the contribution of each component. We will open-source our models, data and code to facilitate future research in this direction.

Paper Structure

This paper contains 59 sections, 17 equations, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: (a) Distinct dependency structures for text and audio modality. (b) Due to disparate tokenization rates, the last audio span is of variable length.
  • Figure 2: Overview of the proposed framework and its diffusion reverse process.(a) TtT framework. A unified MLLM that interleaves AR text and NAR audio generation. The model alternates between AR text decoding and NAR audio synthesis based on control tokens. (b) Diffusion reverse process. NAR audio generation through iterative denoising
  • Figure 3: Training loss and attention design.(a) Training pipeline. Starting from a pretrained text LLM, we expand the vocabulary with audio tokens and control symbols. Text spans use AR cross-entropy loss while audio spans use NAR diffusion loss, sharing a single Transformer backbone. (b) Attention pattern. Text spans follow causal attention (left-to-right), while audio spans use bidirectional attention within spans but causal attention across spans, enabling parallel audio generation while preserving cross-modal dependencies.
  • Figure 4: Example of ASR data format.
  • Figure 5: Example of TTS data format.
  • ...and 4 more figures