Table of Contents
Fetching ...

Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions

Jinchuan Tian, Haoran Wang, Bo-Hao Su, Chien-yu Huang, Qingzheng Wang, Jiatong Shi, William Chen, Xun Gong, Siddhant Arora, Chin-Jou Li, Masao Someki, Takashi Maekaku, Yusuke Shinohara, Jin Sakuma, Chao-Han Huck Yang, Shinji Watanabe

TL;DR

B Bagpiper is among the first works that achieve unified understanding generation for general audio, by pre-training on a massive corpus of 600B tokens that establishes a robust bidirectional mapping between raw audio and this high-level conceptual space.

Abstract

Current audio foundation models typically rely on rigid, task-specific supervision, addressing isolated factors of audio rather than the whole. In contrast, human intelligence processes audio holistically, seamlessly bridging physical signals with abstract cognitive concepts to execute complex tasks. Grounded in this philosophy, we introduce Bagpiper, an 8B audio foundation model that interprets physical audio via rich captions, i.e., comprehensive natural language descriptions that encapsulate the critical cognitive concepts inherent in the signal (e.g., transcription, audio events). By pre-training on a massive corpus of 600B tokens, the model establishes a robust bidirectional mapping between raw audio and this high-level conceptual space. During fine-tuning, Bagpiper adopts a caption-then-process workflow, simulating an intermediate cognitive reasoning step to solve diverse tasks without task-specific priors. Experimentally, Bagpiper outperforms Qwen-2.5-Omni on MMAU and AIRBench for audio understanding and surpasses CosyVoice3 and TangoFlux in generation quality, capable of synthesizing arbitrary compositions of speech, music, and sound effects. To the best of our knowledge, Bagpiper is among the first works that achieve unified understanding generation for general audio. Model, data, and code are available at Bagpiper Home Page.

Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions

TL;DR

B Bagpiper is among the first works that achieve unified understanding generation for general audio, by pre-training on a massive corpus of 600B tokens that establishes a robust bidirectional mapping between raw audio and this high-level conceptual space.

Abstract

Current audio foundation models typically rely on rigid, task-specific supervision, addressing isolated factors of audio rather than the whole. In contrast, human intelligence processes audio holistically, seamlessly bridging physical signals with abstract cognitive concepts to execute complex tasks. Grounded in this philosophy, we introduce Bagpiper, an 8B audio foundation model that interprets physical audio via rich captions, i.e., comprehensive natural language descriptions that encapsulate the critical cognitive concepts inherent in the signal (e.g., transcription, audio events). By pre-training on a massive corpus of 600B tokens, the model establishes a robust bidirectional mapping between raw audio and this high-level conceptual space. During fine-tuning, Bagpiper adopts a caption-then-process workflow, simulating an intermediate cognitive reasoning step to solve diverse tasks without task-specific priors. Experimentally, Bagpiper outperforms Qwen-2.5-Omni on MMAU and AIRBench for audio understanding and surpasses CosyVoice3 and TangoFlux in generation quality, capable of synthesizing arbitrary compositions of speech, music, and sound effects. To the best of our knowledge, Bagpiper is among the first works that achieve unified understanding generation for general audio. Model, data, and code are available at Bagpiper Home Page.
Paper Structure (31 sections, 9 figures, 10 tables)

This paper contains 31 sections, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Bagpiper builds rich caption as a holistic and semantic medium before materializing responses for audio understanding (left) and audio generation (right) tasks.
  • Figure 2: Bagpiper architecture
  • Figure 3: Pre-training data labeling and filtering workflow.
  • Figure 4: SFT data simulation pipeline §\ref{['sft_pipeline']} for open-ended audio task solving. Blue means model input; green means model output. Dashed boxes are simulated by prompting LLMs. Components with * are included in training sequences.
  • Figure 5: Spectrum of audio samples generated by Bagpipier. Our model is capable of generating multi-speaker dialogue with sound and music (upper) and singing voice with accompaniment (downer).
  • ...and 4 more figures