Detecting Machine-Generated Music with Explainability -- A Challenge and Early Benchmarks
Yupei Li, Qiyang Sun, Hanqian Li, Lucia Specia, Björn W. Schuller
TL;DR
Detecting machine-generated music (MGM) is critical to preserve artistic integrity as MGM proliferates. The authors benchmark seven model families—traditional ML, deep neural networks, Transformer-based architectures, state-space models, and multimodal approaches—on FakeMusicCaps and the M6 dataset, supplemented by explainable AI analyses to interpret decisions. They find in-domain performance can be high (e.g., MobileNet around $0.968$ accuracy) but out-of-domain generalization degrades, while multimodal models yield gains (around $0.975$) when lyrics are available. The work provides the first comprehensive MGMD benchmarks, introduces XAI-based fidelity analyses and ensemble explanations, and outlines directions to build more robust and interpretable MGM detectors.
Abstract
Machine-generated music (MGM) has become a groundbreaking innovation with wide-ranging applications, such as music therapy, personalised editing, and creative inspiration within the music industry. However, the unregulated proliferation of MGM presents considerable challenges to the entertainment, education, and arts sectors by potentially undermining the value of high-quality human compositions. Consequently, MGM detection (MGMD) is crucial for preserving the integrity of these fields. Despite its significance, MGMD domain lacks comprehensive benchmark results necessary to drive meaningful progress. To address this gap, we conduct experiments on existing large-scale datasets using a range of foundational models for audio processing, establishing benchmark results tailored to the MGMD task. Our selection includes traditional machine learning models, deep neural networks, Transformer-based architectures, and State Space Models (SSM). Recognising the inherently multimodal nature of music, which integrates both melody and lyrics, we also explore fundamental multimodal models in our experiments. Beyond providing basic binary classification outcomes, we delve deeper into model behaviour using multiple explainable Aritificial Intelligence (XAI) tools, offering insights into their decision-making processes. Our analysis reveals that ResNet18 performs the best according to in-domain and out-of-domain tests. By providing a comprehensive comparison of benchmark results and their interpretability, we propose several directions to inspire future research to develop more robust and effective detection methods for MGM.
