Table of Contents
Fetching ...

DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment

Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Sung-Feng Huang, Chih-Kai Yang, Chee-En Yu, Chun-Wei Chen, Wei-Chih Chen, Chien-yu Huang, Yi-Cheng Lin, Yu-Xiang Lin, Chi-An Fu, Chun-Yi Kuan, Wenze Ren, Xuanjun Chen, Wei-Ping Huang, En-Pei Hu, Tzu-Quan Lin, Yuan-Kuei Wu, Kuan-Po Huang, Hsiao-Ying Huang, Huang-Cheng Chou, Kai-Wei Chang, Cheng-Han Chiang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee

Abstract

We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following. Recent LALMs augment Large Language Models (LLMs) with auditory capabilities by training on large-scale audio-instruction datasets. However, existing LALMs have often suffered from the catastrophic forgetting of the LLM's original abilities. Therefore, balancing knowledge retention and audio perception has become a critical challenge. To address this, we revisit the data construction pipeline and propose a self-generated cross-modal alignment strategy in which the backbone LLM generates its own training targets, named DeSTA. This approach aims at preserving the LLM's native language proficiency thereby enabling zero-shot generalization without task-specific tuning. We construct DeSTA-AQA5M, a large-scale, task-agnostic dataset containing 5 million training samples derived from 7,000 hours of audio spanning 50 diverse datasets, including speech, environmental sounds, and music. DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks, including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench. Comprehensive comparative studies demonstrate that our self-generated strategy outperforms existing training strategies. Our findings underscore the importance of carefully designed data construction in LALM development and offer practical insights for building robust, general-purpose LALMs.

DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment

Abstract

We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following. Recent LALMs augment Large Language Models (LLMs) with auditory capabilities by training on large-scale audio-instruction datasets. However, existing LALMs have often suffered from the catastrophic forgetting of the LLM's original abilities. Therefore, balancing knowledge retention and audio perception has become a critical challenge. To address this, we revisit the data construction pipeline and propose a self-generated cross-modal alignment strategy in which the backbone LLM generates its own training targets, named DeSTA. This approach aims at preserving the LLM's native language proficiency thereby enabling zero-shot generalization without task-specific tuning. We construct DeSTA-AQA5M, a large-scale, task-agnostic dataset containing 5 million training samples derived from 7,000 hours of audio spanning 50 diverse datasets, including speech, environmental sounds, and music. DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks, including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench. Comprehensive comparative studies demonstrate that our self-generated strategy outperforms existing training strategies. Our findings underscore the importance of carefully designed data construction in LALM development and offer practical insights for building robust, general-purpose LALMs.

Paper Structure

This paper contains 32 sections, 6 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: A general-purpose Large Audio Language Model (LALM) understands multifaceted audio information and accepts arbitrary text prompts to generate corresponding responses, enabling seamless multimodal interaction.
  • Figure 2: (Left)Dataset Construction: An audio description $x^\text{text}$ and a randomly sampled prompt $\mathbf{p}$ are fed into the backbone LLM to generate training targets $\mathbf{y}$. (Right)Model Training: The fusion model is trained using the self-generated targets $\mathbf{y}$ along with the corresponding audio inputs $x^\text{audio}$ and prompt $\mathbf{p}$. The fire and snowflake icons indicate trainable and frozen modules, respectively. The audio decoder is optional.
  • Figure 3: Illustration of the CLAP-style evaluation using the backbone LLM for embedding extraction. (a) Representation Extraction: Inputs are fed into the LLM. We utilize the hidden state of the last token as the aggregated representation for the input. (b) Similarity-based Classification: The derived audio representation is compared against the text representations of all candidate class labels using cosine similarity. The label with the highest similarity score (in this example, "Male") is selected as the prediction.
  • Figure 4: LALM performance on the Dynamic-SUPERB Phase-2 benchmark huang2025dynamicsuperb, relative to the ASR+LLM baseline. Each cell shows a model’s relative score in a specific domain, with darker blue indicating a higher performance rank. The bottom rows summarize each model’s domain win count (i.e., positive scores) and average performance across all domains.
  • Figure 5: Epoch-wise evolution of cross-modal alignment. The curves depict similarity-based classification accuracy during training.
  • ...and 2 more figures