TASU: Text-Only Alignment for Speech Understanding
Jing Peng, Yi Yang, Xu Li, Yu Xi, Quanwei Tang, Yangui Fang, Junjie Li, Kai Yu
TL;DR
Speech LLMs traditionally rely on large-scale audio–text data for cross-modal alignment, which is data- and compute-intensive and often limits generalization. TASU proposes a text-only alignment framework using two components, Label-Synchronous Decoding (LSD) and CTC Posterior Simulation (CPS), to align speech and text through a shared CTC-posterior interface while keeping the LLM frozen. It achieves zero-shot ASR with minimal degradation and, as curriculum pre-training, improves cross-domain recognition and enables strong zero-shot multitask performance on MMSU, surpassing several prominent Speech LLMs at similar data scales. Overall, TASU offers an efficient, scalable path to generalizable Speech LLMs that leverage abundant text data with reduced audio supervision.
Abstract
Recent advances in Speech Large Language Models (Speech LLMs) have paved the way for unified architectures across diverse speech understanding tasks. However, prevailing alignment paradigms rely heavily on large-scale audio-text paired data and computationally intensive training, yet often exhibit limited generalization to unseen domains or tasks. To address these limitations, we propose TASU (Text-only Alignment for Speech Understanding), a novel alignment paradigm that can leverage only unpaired text data to guide cross-modal alignment. Experiments show that TASU achieves competitive zero-shot speech recognition. Leveraging this property, it can further function as a pre-training stage in curriculum learning, enhancing domain generalization in speech recognition. Ultimately, TASU can extend its zero-shot generalization to a wide range of speech understanding tasks and notably outperforms prominent Speech LLMs including GLM-4-Voice and Step-Audio on the MMSU benchmark, establishing TASU as an efficient and scalable alignment paradigm for Speech LLMs.
