Table of Contents
Fetching ...

Perceptual Musical Features for Interpretable Audio Tagging

Vassilis Lyberatos, Spyridon Kantarelis, Edmund Dervakos, Giorgos Stamou

TL;DR

There are use cases where the deterioration in performance is outweighed by the value of interpretability, and this study concludes that there are use cases where the deterioration in performance is outweighed by the value of interpretability.

Abstract

In the age of music streaming platforms, the task of automatically tagging music audio has garnered significant attention, driving researchers to devise methods aimed at enhancing performance metrics on standard datasets. Most recent approaches rely on deep neural networks, which, despite their impressive performance, possess opacity, making it challenging to elucidate their output for a given input. While the issue of interpretability has been emphasized in other fields like medicine, it has not received attention in music-related tasks. In this study, we explored the relevance of interpretability in the context of automatic music tagging. We constructed a workflow that incorporates three different information extraction techniques: a) leveraging symbolic knowledge, b) utilizing auxiliary deep neural networks, and c) employing signal processing to extract perceptual features from audio files. These features were subsequently used to train an interpretable machine-learning model for tag prediction. We conducted experiments on two datasets, namely the MTG-Jamendo dataset and the GTZAN dataset. Our method surpassed the performance of baseline models in both tasks and, in certain instances, demonstrated competitiveness with the current state-of-the-art. We conclude that there are use cases where the deterioration in performance is outweighed by the value of interpretability.

Perceptual Musical Features for Interpretable Audio Tagging

TL;DR

There are use cases where the deterioration in performance is outweighed by the value of interpretability, and this study concludes that there are use cases where the deterioration in performance is outweighed by the value of interpretability.

Abstract

In the age of music streaming platforms, the task of automatically tagging music audio has garnered significant attention, driving researchers to devise methods aimed at enhancing performance metrics on standard datasets. Most recent approaches rely on deep neural networks, which, despite their impressive performance, possess opacity, making it challenging to elucidate their output for a given input. While the issue of interpretability has been emphasized in other fields like medicine, it has not received attention in music-related tasks. In this study, we explored the relevance of interpretability in the context of automatic music tagging. We constructed a workflow that incorporates three different information extraction techniques: a) leveraging symbolic knowledge, b) utilizing auxiliary deep neural networks, and c) employing signal processing to extract perceptual features from audio files. These features were subsequently used to train an interpretable machine-learning model for tag prediction. We conducted experiments on two datasets, namely the MTG-Jamendo dataset and the GTZAN dataset. Our method surpassed the performance of baseline models in both tasks and, in certain instances, demonstrated competitiveness with the current state-of-the-art. We conclude that there are use cases where the deterioration in performance is outweighed by the value of interpretability.
Paper Structure (9 sections, 5 figures)

This paper contains 9 sections, 5 figures.

Figures (5)

  • Figure 1: Overview of the proposed pipeline
  • Figure 2: Overall and permutation-based feature importance of the model trained on the MTG Jamendo dataset
  • Figure 3: Ablation studies with different three groups of features: harmonic, signal, and midlevel on both datasets
  • Figure 4: Interpretations of the prediction for the label Disco in the GTZAN dataset based on SHAP values and model's feature importance.
  • Figure 5: Interpretations of the prediction for the label Trailer in the MTG Jamendo dataset based on SHAP values and model's feature importance