Table of Contents
Fetching ...

MusicLIME: Explainable Multimodal Music Understanding

Theodoros Sotirou, Vassilis Lyberatos, Orfeas Menis Mastromichalakis, Giorgos Stamou

TL;DR

MusicLIME addresses the explainability gap in multimodal music understanding by extending LIME to jointly explain interactions between audio and lyrics. It employs a transformer-based multimodal backbone (RoBERTa-large for text and AST for audio) with a concatenated embedding fed to a classifier, and provides both local multimodal explanations and global aggregations. Key contributions include two curated multimodal datasets (M4A and an AudioSet-derived subset), a robust global aggregation approach (Global Average Importance and entropy-based Homogeneity-Weighted Importance), and empirical evidence that multimodal explanations yield insights beyond unimodal analyses. The work demonstrates that multimodal explanations can reveal genre- and emotion-specific feature interactions, enhancing interpretability and informing fair, transparent music understanding systems.

Abstract

Multimodal models are critical for music understanding tasks, as they capture the complex interplay between audio and lyrics. However, as these models become more prevalent, the need for explainability grows-understanding how these systems make decisions is vital for ensuring fairness, reducing bias, and fostering trust. In this paper, we introduce MusicLIME, a model-agnostic feature importance explanation method designed for multimodal music models. Unlike traditional unimodal methods, which analyze each modality separately without considering the interaction between them, often leading to incomplete or misleading explanations, MusicLIME reveals how audio and lyrical features interact and contribute to predictions, providing a holistic view of the model's decision-making. Additionally, we enhance local explanations by aggregating them into global explanations, giving users a broader perspective of model behavior. Through this work, we contribute to improving the interpretability of multimodal music models, empowering users to make informed choices, and fostering more equitable, fair, and transparent music understanding systems.

MusicLIME: Explainable Multimodal Music Understanding

TL;DR

MusicLIME addresses the explainability gap in multimodal music understanding by extending LIME to jointly explain interactions between audio and lyrics. It employs a transformer-based multimodal backbone (RoBERTa-large for text and AST for audio) with a concatenated embedding fed to a classifier, and provides both local multimodal explanations and global aggregations. Key contributions include two curated multimodal datasets (M4A and an AudioSet-derived subset), a robust global aggregation approach (Global Average Importance and entropy-based Homogeneity-Weighted Importance), and empirical evidence that multimodal explanations yield insights beyond unimodal analyses. The work demonstrates that multimodal explanations can reveal genre- and emotion-specific feature interactions, enhancing interpretability and informing fair, transparent music understanding systems.

Abstract

Multimodal models are critical for music understanding tasks, as they capture the complex interplay between audio and lyrics. However, as these models become more prevalent, the need for explainability grows-understanding how these systems make decisions is vital for ensuring fairness, reducing bias, and fostering trust. In this paper, we introduce MusicLIME, a model-agnostic feature importance explanation method designed for multimodal music models. Unlike traditional unimodal methods, which analyze each modality separately without considering the interaction between them, often leading to incomplete or misleading explanations, MusicLIME reveals how audio and lyrical features interact and contribute to predictions, providing a holistic view of the model's decision-making. Additionally, we enhance local explanations by aggregating them into global explanations, giving users a broader perspective of model behavior. Through this work, we contribute to improving the interpretability of multimodal music models, empowering users to make informed choices, and fostering more equitable, fair, and transparent music understanding systems.
Paper Structure (10 sections, 2 equations, 3 figures, 1 table)

This paper contains 10 sections, 2 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Overview of MusicLIME.
  • Figure 2: Top 10 features from the global aggregates for the hip hop, punk, and pop genres from the Music4All dataset.
  • Figure 3: Top 10 lyrical features for the heavy music, hip hop, and pop genres for both datasets clustered.