Table of Contents
Fetching ...

FakeMusicCaps: a Dataset for Detection and Attribution of Synthetic Music Generated via Text-to-Music Models

Luca Comanducci, Paolo Bestagini, Stefano Tubaro

TL;DR

The paper tackles the detection and attribution of synthetic music generated by Text-To-Music models. It introduces FakeMusicCaps, a dataset created by generating 10-second tracks from MusicCaps captions using five TTM architectures to study real/fake discrimination and model attribution under closed- and open-set conditions. Through simple baselines (M5, RawNet2, ResNet18+Spec), the work demonstrates that closed-set attribution is feasible with strong features, while open-set attribution remains challenging and is sensitive to window size and classifier choice. The dataset and baseline results provide a practical benchmark to drive forensic research as TTM models continue to evolve and proliferate.

Abstract

Text-To-Music (TTM) models have recently revolutionized the automatic music generation research field. Specifically, by reaching superior performances to all previous state-of-the-art models and by lowering the technical proficiency needed to use them. Due to these reasons, they have readily started to be adopted for commercial uses and music production practices. This widespread diffusion of TTMs poses several concerns regarding copyright violation and rightful attribution, posing the need of serious consideration of them by the audio forensics community. In this paper, we tackle the problem of detection and attribution of TTM-generated data. We propose a dataset, FakeMusicCaps that contains several versions of the music-caption pairs dataset MusicCaps re-generated via several state-of-the-art TTM techniques. We evaluate the proposed dataset by performing initial experiments regarding the detection and attribution of TTM-generated audio.

FakeMusicCaps: a Dataset for Detection and Attribution of Synthetic Music Generated via Text-to-Music Models

TL;DR

The paper tackles the detection and attribution of synthetic music generated by Text-To-Music models. It introduces FakeMusicCaps, a dataset created by generating 10-second tracks from MusicCaps captions using five TTM architectures to study real/fake discrimination and model attribution under closed- and open-set conditions. Through simple baselines (M5, RawNet2, ResNet18+Spec), the work demonstrates that closed-set attribution is feasible with strong features, while open-set attribution remains challenging and is sensitive to window size and classifier choice. The dataset and baseline results provide a practical benchmark to drive forensic research as TTM models continue to evolve and proliferate.

Abstract

Text-To-Music (TTM) models have recently revolutionized the automatic music generation research field. Specifically, by reaching superior performances to all previous state-of-the-art models and by lowering the technical proficiency needed to use them. Due to these reasons, they have readily started to be adopted for commercial uses and music production practices. This widespread diffusion of TTMs poses several concerns regarding copyright violation and rightful attribution, posing the need of serious consideration of them by the audio forensics community. In this paper, we tackle the problem of detection and attribution of TTM-generated data. We propose a dataset, FakeMusicCaps that contains several versions of the music-caption pairs dataset MusicCaps re-generated via several state-of-the-art TTM techniques. We evaluate the proposed dataset by performing initial experiments regarding the detection and attribution of TTM-generated audio.
Paper Structure (16 sections, 3 figures, 3 tables)

This paper contains 16 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Schematic representation of the text-to-music attribution problem.
  • Figure 2: Confusion matrices of M5 (top), RawNet2 (middle) and ResNet+Spec (bottom) in the three classification scenarios.
  • Figure 3: Balanced accuracy varying according to the considered window size using M5 (), RawNet2 () and ResNet + Spec ().