Table of Contents
Fetching ...

TASU: Text-Only Alignment for Speech Understanding

Jing Peng, Yi Yang, Xu Li, Yu Xi, Quanwei Tang, Yangui Fang, Junjie Li, Kai Yu

TL;DR

Speech LLMs traditionally rely on large-scale audio–text data for cross-modal alignment, which is data- and compute-intensive and often limits generalization. TASU proposes a text-only alignment framework using two components, Label-Synchronous Decoding (LSD) and CTC Posterior Simulation (CPS), to align speech and text through a shared CTC-posterior interface while keeping the LLM frozen. It achieves zero-shot ASR with minimal degradation and, as curriculum pre-training, improves cross-domain recognition and enables strong zero-shot multitask performance on MMSU, surpassing several prominent Speech LLMs at similar data scales. Overall, TASU offers an efficient, scalable path to generalizable Speech LLMs that leverage abundant text data with reduced audio supervision.

Abstract

Recent advances in Speech Large Language Models (Speech LLMs) have paved the way for unified architectures across diverse speech understanding tasks. However, prevailing alignment paradigms rely heavily on large-scale audio-text paired data and computationally intensive training, yet often exhibit limited generalization to unseen domains or tasks. To address these limitations, we propose TASU (Text-only Alignment for Speech Understanding), a novel alignment paradigm that can leverage only unpaired text data to guide cross-modal alignment. Experiments show that TASU achieves competitive zero-shot speech recognition. Leveraging this property, it can further function as a pre-training stage in curriculum learning, enhancing domain generalization in speech recognition. Ultimately, TASU can extend its zero-shot generalization to a wide range of speech understanding tasks and notably outperforms prominent Speech LLMs including GLM-4-Voice and Step-Audio on the MMSU benchmark, establishing TASU as an efficient and scalable alignment paradigm for Speech LLMs.

TASU: Text-Only Alignment for Speech Understanding

TL;DR

Speech LLMs traditionally rely on large-scale audio–text data for cross-modal alignment, which is data- and compute-intensive and often limits generalization. TASU proposes a text-only alignment framework using two components, Label-Synchronous Decoding (LSD) and CTC Posterior Simulation (CPS), to align speech and text through a shared CTC-posterior interface while keeping the LLM frozen. It achieves zero-shot ASR with minimal degradation and, as curriculum pre-training, improves cross-domain recognition and enables strong zero-shot multitask performance on MMSU, surpassing several prominent Speech LLMs at similar data scales. Overall, TASU offers an efficient, scalable path to generalizable Speech LLMs that leverage abundant text data with reduced audio supervision.

Abstract

Recent advances in Speech Large Language Models (Speech LLMs) have paved the way for unified architectures across diverse speech understanding tasks. However, prevailing alignment paradigms rely heavily on large-scale audio-text paired data and computationally intensive training, yet often exhibit limited generalization to unseen domains or tasks. To address these limitations, we propose TASU (Text-only Alignment for Speech Understanding), a novel alignment paradigm that can leverage only unpaired text data to guide cross-modal alignment. Experiments show that TASU achieves competitive zero-shot speech recognition. Leveraging this property, it can further function as a pre-training stage in curriculum learning, enhancing domain generalization in speech recognition. Ultimately, TASU can extend its zero-shot generalization to a wide range of speech understanding tasks and notably outperforms prominent Speech LLMs including GLM-4-Voice and Step-Audio on the MMSU benchmark, establishing TASU as an efficient and scalable alignment paradigm for Speech LLMs.

Paper Structure

This paper contains 11 sections, 4 equations, 1 figure, 4 tables, 1 algorithm.

Figures (1)

  • Figure 1: An Overview of TASU: during training (left), only text inputs are used: transcriptions are tokenized into one-hot vectors and converted into pseudo CTC posteriors via simulation. During inference (right), speech is encoded to generate real CTC posteriors, which are refined by label-synchronous decoding. Both pseudo and real CTC posteriors are mapped by a trainable projector into the frozen LLM, producing outputs such as transcriptions or other speech understanding tasks.