Music Flamingo: Scaling Music Understanding in Audio Language Models

Sreyan Ghosh; Arushi Goel; Lasha Koroshinadze; Sang-gil Lee; Zhifeng Kong; Joao Felipe Santos; Ramani Duraiswami; Dinesh Manocha; Wei Ping; Mohammad Shoeybi; Bryan Catanzaro

Music Flamingo: Scaling Music Understanding in Audio Language Models

Sreyan Ghosh, Arushi Goel, Lasha Koroshinadze, Sang-gil Lee, Zhifeng Kong, Joao Felipe Santos, Ramani Duraiswami, Dinesh Manocha, Wei Ping, Mohammad Shoeybi, Bryan Catanzaro

TL;DR

Music Flamingo tackles the problem of scaling music understanding in audio–language models by recasting music captioning and QA as reasoning tasks, and by introducing large-scale, multi-cultural data (MF-Skills) and a reasoning-focused post-training pipeline (MF-Think with GRPO). It improves the Audio Flamingo 3 backbone with multilingual ASR, multi-speaker data, and time-aware encoding, then trains with curated captions and QA targeting harmony, structure, lyrics, and culture. The approach yields state-of-the-art results on 12 benchmarks, including strong performance on music QA, MIR, and lyrics transcription, and demonstrates the value of explicit step-by-step reasoning for music understanding. The work also provides datasets and training recipes to support future research toward human-like musical interpretation in ALMs.

Abstract

We introduce Music Flamingo, a novel large audio-language model designed to advance music (including song) understanding in foundational audio models. While audio-language research has progressed rapidly, music remains challenging due to its dynamic, layered, and information-dense nature. Progress has been further limited by the difficulty of scaling open audio understanding models, primarily because of the scarcity of high-quality music data and annotations. As a result, prior models are restricted to producing short, high-level captions, answering only surface-level questions, and showing limited generalization across diverse musical cultures. To address these challenges, we curate MF-Skills, a large-scale dataset labeled through a multi-stage pipeline that yields rich captions and question-answer pairs covering harmony, structure, timbre, lyrics, and cultural context. We fine-tune an enhanced Audio Flamingo 3 backbone on MF-Skills and further strengthen multiple skills relevant to music understanding. To improve the model's reasoning abilities, we introduce a post-training recipe: we first cold-start with MF-Think, a novel chain-of-thought dataset grounded in music theory, followed by GRPO-based reinforcement learning with custom rewards. Music Flamingo achieves state-of-the-art results across 10+ benchmarks for music understanding and reasoning, establishing itself as a generalist and musically intelligent audio-language model. Beyond strong empirical results, Music Flamingo sets a new standard for advanced music understanding by demonstrating how models can move from surface-level recognition toward layered, human-like perception of songs. We believe this work provides both a benchmark and a foundation for the community to build the next generation of models that engage with music as meaningfully as humans do.

Music Flamingo: Scaling Music Understanding in Audio Language Models

TL;DR

Abstract

Music Flamingo: Scaling Music Understanding in Audio Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (26)