Table of Contents
Fetching ...

Music Flamingo: Scaling Music Understanding in Audio Language Models

Sreyan Ghosh, Arushi Goel, Lasha Koroshinadze, Sang-gil Lee, Zhifeng Kong, Joao Felipe Santos, Ramani Duraiswami, Dinesh Manocha, Wei Ping, Mohammad Shoeybi, Bryan Catanzaro

TL;DR

Music Flamingo tackles the problem of scaling music understanding in audio–language models by recasting music captioning and QA as reasoning tasks, and by introducing large-scale, multi-cultural data (MF-Skills) and a reasoning-focused post-training pipeline (MF-Think with GRPO). It improves the Audio Flamingo 3 backbone with multilingual ASR, multi-speaker data, and time-aware encoding, then trains with curated captions and QA targeting harmony, structure, lyrics, and culture. The approach yields state-of-the-art results on 12 benchmarks, including strong performance on music QA, MIR, and lyrics transcription, and demonstrates the value of explicit step-by-step reasoning for music understanding. The work also provides datasets and training recipes to support future research toward human-like musical interpretation in ALMs.

Abstract

We introduce Music Flamingo, a novel large audio-language model designed to advance music (including song) understanding in foundational audio models. While audio-language research has progressed rapidly, music remains challenging due to its dynamic, layered, and information-dense nature. Progress has been further limited by the difficulty of scaling open audio understanding models, primarily because of the scarcity of high-quality music data and annotations. As a result, prior models are restricted to producing short, high-level captions, answering only surface-level questions, and showing limited generalization across diverse musical cultures. To address these challenges, we curate MF-Skills, a large-scale dataset labeled through a multi-stage pipeline that yields rich captions and question-answer pairs covering harmony, structure, timbre, lyrics, and cultural context. We fine-tune an enhanced Audio Flamingo 3 backbone on MF-Skills and further strengthen multiple skills relevant to music understanding. To improve the model's reasoning abilities, we introduce a post-training recipe: we first cold-start with MF-Think, a novel chain-of-thought dataset grounded in music theory, followed by GRPO-based reinforcement learning with custom rewards. Music Flamingo achieves state-of-the-art results across 10+ benchmarks for music understanding and reasoning, establishing itself as a generalist and musically intelligent audio-language model. Beyond strong empirical results, Music Flamingo sets a new standard for advanced music understanding by demonstrating how models can move from surface-level recognition toward layered, human-like perception of songs. We believe this work provides both a benchmark and a foundation for the community to build the next generation of models that engage with music as meaningfully as humans do.

Music Flamingo: Scaling Music Understanding in Audio Language Models

TL;DR

Music Flamingo tackles the problem of scaling music understanding in audio–language models by recasting music captioning and QA as reasoning tasks, and by introducing large-scale, multi-cultural data (MF-Skills) and a reasoning-focused post-training pipeline (MF-Think with GRPO). It improves the Audio Flamingo 3 backbone with multilingual ASR, multi-speaker data, and time-aware encoding, then trains with curated captions and QA targeting harmony, structure, lyrics, and culture. The approach yields state-of-the-art results on 12 benchmarks, including strong performance on music QA, MIR, and lyrics transcription, and demonstrates the value of explicit step-by-step reasoning for music understanding. The work also provides datasets and training recipes to support future research toward human-like musical interpretation in ALMs.

Abstract

We introduce Music Flamingo, a novel large audio-language model designed to advance music (including song) understanding in foundational audio models. While audio-language research has progressed rapidly, music remains challenging due to its dynamic, layered, and information-dense nature. Progress has been further limited by the difficulty of scaling open audio understanding models, primarily because of the scarcity of high-quality music data and annotations. As a result, prior models are restricted to producing short, high-level captions, answering only surface-level questions, and showing limited generalization across diverse musical cultures. To address these challenges, we curate MF-Skills, a large-scale dataset labeled through a multi-stage pipeline that yields rich captions and question-answer pairs covering harmony, structure, timbre, lyrics, and cultural context. We fine-tune an enhanced Audio Flamingo 3 backbone on MF-Skills and further strengthen multiple skills relevant to music understanding. To improve the model's reasoning abilities, we introduce a post-training recipe: we first cold-start with MF-Think, a novel chain-of-thought dataset grounded in music theory, followed by GRPO-based reinforcement learning with custom rewards. Music Flamingo achieves state-of-the-art results across 10+ benchmarks for music understanding and reasoning, establishing itself as a generalist and musically intelligent audio-language model. Beyond strong empirical results, Music Flamingo sets a new standard for advanced music understanding by demonstrating how models can move from surface-level recognition toward layered, human-like perception of songs. We believe this work provides both a benchmark and a foundation for the community to build the next generation of models that engage with music as meaningfully as humans do.

Paper Structure

This paper contains 17 sections, 1 equation, 26 figures, 5 tables.

Figures (26)

  • Figure 1: Comparison of captions for two diverse, full-length, in-the-wild songs by Music Flamingo and other frontier models. Prior models, such as AF3, tend to output short, surface-level descriptions (e.g., broad genre, tempo, or instrumentation), while Qwen3-Omni offers isolated observations without forming a coherent musical narrative. In contrast, Music Flamingo produces detailed, multi-layered captions that integrate theory-aware analysis with performance context. It links surface attributes (tempo, key, etc.) to mid-level structures (chord progressions, vocal phrasing, etc) and higher-level dimensions (lyrical meaning, emotional trajectory, etc.). This ability to connect one aspect of music to another results in richer, more holistic captions that resemble how trained musicians describe songs. Detailed expert analysis in Appendix \ref{['sec.user_study']} and \ref{['sec.user_study_cultures']}.
  • Figure 2: I. Annotation pipeline for constructing our proposed datasets from diverse music clips. II. Training pipeline of Music Flamingo: we begin by improving Audio Flamingo 3, then perform full fine-tuning on music datasets to build the Music Flamingo foundation model. Finally, the model undergoes reasoning cold-start training followed by GRPO fine-tuning to enable step-by-step reasoning.
  • Figure 3: Examples from MF-Skills Caption, MF-Skills QA, and MF-Think. We emphasize that our re-imagined captions are denser, more informative, and designed to require deliberate reasoning to generate. Additional examples are provided in Appendix \ref{['subsec:examples_mfthink']}.
  • Figure 4: Genres (inner circle) & Cultures (outer circle) distribution of songs.
  • Figure 5: Caption generated by Music Flamingo on a modern Spanish song.
  • ...and 21 more figures