Music's Multimodal Complexity in AVQA: Why We Need More than General Multimodal LLMs

Wenhao You; Xingjian Diao; Chunhui Zhang; Keyi Kong; Weiyi Wu; Zhongyu Ouyang; Chiyu Ma; Tingxuan Wu; Noah Wei; Zong Ke; Ming Cheng; Soroush Vosoughi; Jiang Gui

Music's Multimodal Complexity in AVQA: Why We Need More than General Multimodal LLMs

Wenhao You, Xingjian Diao, Chunhui Zhang, Keyi Kong, Weiyi Wu, Zhongyu Ouyang, Chiyu Ma, Tingxuan Wu, Noah Wei, Zong Ke, Ming Cheng, Soroush Vosoughi, Jiang Gui

TL;DR

Music AVQA exposes limitations of general multimodal models by requiring beat-accurate, instrument-specific reasoning over densely layered audio-visual content. The paper argues for specialized input pipelines, spatial-temporal architectures, and musical priors, supported by empirical analyses of datasets (MUSIC-AVQA, v2.0, and MUSIC-AVQA-R) and method families. It proposes concrete directions such as incorporating fine-grained musical event cues, mid-level musical structure, latent reasoning trajectories, and supervised chain-of-thought to enhance interpretability and generalization. Collectively, these contributions lay a foundation for robust, domain-aware multimodal understanding in music and suggest broader applicability to other complex, structured modalities.

Abstract

While recent Multimodal Large Language Models exhibit impressive capabilities for general multimodal tasks, specialized domains like music necessitate tailored approaches. Music Audio-Visual Question Answering (Music AVQA) particularly underscores this, presenting unique challenges with its continuous, densely layered audio-visual content, intricate temporal dynamics, and the critical need for domain-specific knowledge. Through a systematic analysis of Music AVQA datasets and methods, this position paper identifies that specialized input processing, architectures incorporating dedicated spatial-temporal designs, and music-specific modeling strategies are critical for success in this domain. Our study provides valuable insights for researchers by highlighting effective design patterns empirically linked to strong performance, proposing concrete future directions for incorporating musical priors, and aiming to establish a robust foundation for advancing multimodal musical understanding. This work is intended to inspire broader attention and further research, supported by a continuously updated anonymous GitHub repository of relevant papers: https://github.com/xid32/Survey4MusicAVQA.

Music's Multimodal Complexity in AVQA: Why We Need More than General Multimodal LLMs

TL;DR

Abstract

Music's Multimodal Complexity in AVQA: Why We Need More than General Multimodal LLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)