Table of Contents
Fetching ...

Step-Audio-R1 Technical Report

Fei Tian, Xiangyu Tony Zhang, Yuxin Zhang, Haoyang Zhang, Yuxin Li, Daijiao Liu, Yayue Deng, Donghang Wu, Jun Chen, Liang Zhao, Chengyuan Yao, Hexin Liu, Eng Siong Chng, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu

TL;DR

Step-Audio-R1 addresses the paradox of inverted reasoning benefits in audio-language models by introducing Modality-Grounded Reasoning Distillation (MGRD), an iterative framework that grounds reasoning in acoustic features rather than textual surrogates. The approach combines a frozen audio encoder, an LLM decoder, supervised and reinforcement learning, and iterative self-distillation to shift reasoning from transcripts to native acoustic analysis. Empirical results show Step-Audio-R1 outperforms Gemini 2.5 Pro and matches Gemini 3 Pro on comprehensive audio benchmarks, demonstrating that reasoning is transferable across modalities when anchored to the correct input. This work enables truly multimodal reasoning systems that deeply analyze audio signals and lays groundwork for future cross-modal AI systems.

Abstract

Recent advances in reasoning models have demonstrated remarkable success in text and vision domains through extended chain-of-thought deliberation. However, a perplexing phenomenon persists in audio language models: they consistently perform better with minimal or no reasoning, raising a fundamental question - can audio intelligence truly benefit from deliberate thinking? We introduce Step-Audio-R1, the first audio reasoning model that successfully unlocks reasoning capabilities in the audio domain. Through our proposed Modality-Grounded Reasoning Distillation (MGRD) framework, Step-Audio-R1 learns to generate audio-relevant reasoning chains that genuinely ground themselves in acoustic features rather than hallucinating disconnected deliberations. Our model exhibits strong audio reasoning capabilities, surpassing Gemini 2.5 Pro and achieving performance comparable to the state-of-the-art Gemini 3 Pro across comprehensive audio understanding and reasoning benchmarks spanning speech, environmental sounds, and music. These results demonstrate that reasoning is a transferable capability across modalities when appropriately anchored, transforming extended deliberation from a liability into a powerful asset for audio intelligence. By establishing the first successful audio reasoning model, Step-Audio-R1 opens new pathways toward building truly multimodal reasoning systems that think deeply across all sensory modalities.

Step-Audio-R1 Technical Report

TL;DR

Step-Audio-R1 addresses the paradox of inverted reasoning benefits in audio-language models by introducing Modality-Grounded Reasoning Distillation (MGRD), an iterative framework that grounds reasoning in acoustic features rather than textual surrogates. The approach combines a frozen audio encoder, an LLM decoder, supervised and reinforcement learning, and iterative self-distillation to shift reasoning from transcripts to native acoustic analysis. Empirical results show Step-Audio-R1 outperforms Gemini 2.5 Pro and matches Gemini 3 Pro on comprehensive audio benchmarks, demonstrating that reasoning is transferable across modalities when anchored to the correct input. This work enables truly multimodal reasoning systems that deeply analyze audio signals and lays groundwork for future cross-modal AI systems.

Abstract

Recent advances in reasoning models have demonstrated remarkable success in text and vision domains through extended chain-of-thought deliberation. However, a perplexing phenomenon persists in audio language models: they consistently perform better with minimal or no reasoning, raising a fundamental question - can audio intelligence truly benefit from deliberate thinking? We introduce Step-Audio-R1, the first audio reasoning model that successfully unlocks reasoning capabilities in the audio domain. Through our proposed Modality-Grounded Reasoning Distillation (MGRD) framework, Step-Audio-R1 learns to generate audio-relevant reasoning chains that genuinely ground themselves in acoustic features rather than hallucinating disconnected deliberations. Our model exhibits strong audio reasoning capabilities, surpassing Gemini 2.5 Pro and achieving performance comparable to the state-of-the-art Gemini 3 Pro across comprehensive audio understanding and reasoning benchmarks spanning speech, environmental sounds, and music. These results demonstrate that reasoning is a transferable capability across modalities when appropriately anchored, transforming extended deliberation from a liability into a powerful asset for audio intelligence. By establishing the first successful audio reasoning model, Step-Audio-R1 opens new pathways toward building truly multimodal reasoning systems that think deeply across all sensory modalities.

Paper Structure

This paper contains 21 sections, 8 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 2: The overview of Step-Audio-R1
  • Figure 3: Modality-Grounded Reasoning Distillation
  • Figure 4: Impact of format rewards on audio reasoning training. (a) Format rewards enable faster and more stable convergence to high reward values. (b) Without format rewards, models exhibit systematic reasoning collapse, reducing generated tokens from 3000 to below 1500.
  • Figure 5: Impact of data selection strategies on audio reasoning training. (a) Training on moderately difficult problems (correct passed) achieves higher and more stable rewards compared to failed problems, which collapse after iteration 50. (b) Moderately difficult problems sustain reasoning generation (2300-2800 tokens) , while failed problems show a progressive decline, settling around 1800-2000 tokens by iteration 60.