Table of Contents
Fetching ...

UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

Jinchuan Tian, Sang-gil Lee, Zhifeng Kong, Sreyan Ghosh, Arushi Goel, Chao-Han Huck Yang, Wenliang Dai, Zihan Liu, Hanrong Ye, Shinji Watanabe, Mohammad Shoeybi, Bryan Catanzaro, Rafael Valle, Wei Ping

TL;DR

The paper introduces UALM, a single decoder-only language framework that unifies audio understanding, text-to-audio generation, and multimodal reasoning. It advances this vision with UALM-Gen, a language-model–based text-to-audio generator trained on large-scale data and enhanced by classifier-free guidance and Direct Preference Optimization, and with UALM pre-training that blends audio tasks and text reasoning to build a shared multimodal representation. To push beyond generation, it introduces UALM-Reason, a post-training regime using rich captions, dialogue, and self-reflection to enable multimodal reasoning and iterative refinement of audio outputs. The results show competitive generation quality with diffusion baselines, strong audio understanding, and improved controllability through reasoning, marking a step toward more controllable and holistic audio intelligence.

Abstract

Recent advances in the audio language modeling (ALM) domain tackle audio understanding and text-to-audio generation as separate tasks. Very few studies attempt to unify these tasks -- an essential step toward advanced multimodal reasoning. This paper introduces U}nified Audio Language Model (UALM), which aims to unify audio understanding, text-to-audio generation, and multimodal reasoning in a single model. To achieve this goal, we first present UALM-Gen, a text-to-audio language model that directly predicts audio tokens and is comparable to state-of-the-art diffusion-based models. We then demonstrate, using proper data blending, training recipes, and inference techniques, that our single UALM model matches the quality of state-of-the-art specialized models in audio understanding, text-to-audio generation, and text reasoning. Furthermore, we present UALM-Reason, a multimodal reasoning model that utilizes both text and audio in the intermediate thinking steps to facilitate complex generation tasks. To our knowledge, this is the first demonstration in audio research of cross-modal generative reasoning, with its effectiveness confirmed by subjective evaluations.

UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

TL;DR

The paper introduces UALM, a single decoder-only language framework that unifies audio understanding, text-to-audio generation, and multimodal reasoning. It advances this vision with UALM-Gen, a language-model–based text-to-audio generator trained on large-scale data and enhanced by classifier-free guidance and Direct Preference Optimization, and with UALM pre-training that blends audio tasks and text reasoning to build a shared multimodal representation. To push beyond generation, it introduces UALM-Reason, a post-training regime using rich captions, dialogue, and self-reflection to enable multimodal reasoning and iterative refinement of audio outputs. The results show competitive generation quality with diffusion baselines, strong audio understanding, and improved controllability through reasoning, marking a step toward more controllable and holistic audio intelligence.

Abstract

Recent advances in the audio language modeling (ALM) domain tackle audio understanding and text-to-audio generation as separate tasks. Very few studies attempt to unify these tasks -- an essential step toward advanced multimodal reasoning. This paper introduces U}nified Audio Language Model (UALM), which aims to unify audio understanding, text-to-audio generation, and multimodal reasoning in a single model. To achieve this goal, we first present UALM-Gen, a text-to-audio language model that directly predicts audio tokens and is comparable to state-of-the-art diffusion-based models. We then demonstrate, using proper data blending, training recipes, and inference techniques, that our single UALM model matches the quality of state-of-the-art specialized models in audio understanding, text-to-audio generation, and text reasoning. Furthermore, we present UALM-Reason, a multimodal reasoning model that utilizes both text and audio in the intermediate thinking steps to facilitate complex generation tasks. To our knowledge, this is the first demonstration in audio research of cross-modal generative reasoning, with its effectiveness confirmed by subjective evaluations.

Paper Structure

This paper contains 22 sections, 4 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Humans need understanding, generation, and reasoning to handle complex tasks, like composing music.
  • Figure 2: UALM architecture overview and the multimodal pre-training data blending ratios.
  • Figure 3: Rich caption example (left) and post-training workflow (right).
  • Figure 4: Demos: audio generation reasoning and joint understanding-generation reasoning.
  • Figure 5: Statistics of UALM-Gen model. (a) The CLAP scores (CL) with various CFG $\lambda$; (b) The CLAP scores (CL) with various training data volume down-weighting; (c) the DPO loss w/o adaptation on synthetic data before DPO training; (d) the divergence $\pi_{\theta}(y_w|x) - \pi_{\text{ref}}(y_w|x)$ from the reference model w/o CE loss added in DPO training.
  • ...and 5 more figures