Table of Contents
Fetching ...

Robust Neural Audio Fingerprinting using Music Foundation Models

Shubhr Singh, Kiran Bhat, Xavier Riley, Benjamin Resnick, John Thickstun, Walter De Brouwer

TL;DR

This work tackles robust neural audio fingerprinting under distortions typical of modern platforms by leveraging pretrained music foundation models (MuQ, MERT, BEATs) as backbones and applying extensive data augmentations within a contrastive learning framework. The authors introduce a two-layer projection head and evaluate both track-level and segment-level retrieval, using FAISS for efficient search and a time-alignment estimator to localize matches, with a detailed dataset setup that includes distribution-shift testing. Results show that pretrained backbones consistently outperform models trained from scratch (NAFP, GraFPrint) and a Dejavu baseline, achieving superior track-level accuracy and segment localization, though spectral-filtering remains a weakness to address. The findings suggest that music foundation models enable robust, scalable fingerprinting suitable for catalog management and broadcast monitoring, while highlighting avenues for targeted augmentation and adversarial testing to close remaining gaps.

Abstract

The proliferation of distorted, compressed, and manipulated music on modern media platforms like TikTok motivates the development of more robust audio fingerprinting techniques to identify the sources of musical recordings. In this paper, we develop and evaluate new neural audio fingerprinting techniques with the aim of improving their robustness. We make two contributions to neural fingerprinting methodology: (1) we use a pretrained music foundation model as the backbone of the neural architecture and (2) we expand the use of data augmentation to train fingerprinting models under a wide variety of audio manipulations, including time streching, pitch modulation, compression, and filtering. We systematically evaluate our methods in comparison to two state-of-the-art neural fingerprinting models: NAFP and GraFPrint. Results show that fingerprints extracted with music foundation models (e.g., MuQ, MERT) consistently outperform models trained from scratch or pretrained on non-musical audio. Segment-level evaluation further reveals their capability to accurately localize fingerprint matches, an important practical feature for catalog management.

Robust Neural Audio Fingerprinting using Music Foundation Models

TL;DR

This work tackles robust neural audio fingerprinting under distortions typical of modern platforms by leveraging pretrained music foundation models (MuQ, MERT, BEATs) as backbones and applying extensive data augmentations within a contrastive learning framework. The authors introduce a two-layer projection head and evaluate both track-level and segment-level retrieval, using FAISS for efficient search and a time-alignment estimator to localize matches, with a detailed dataset setup that includes distribution-shift testing. Results show that pretrained backbones consistently outperform models trained from scratch (NAFP, GraFPrint) and a Dejavu baseline, achieving superior track-level accuracy and segment localization, though spectral-filtering remains a weakness to address. The findings suggest that music foundation models enable robust, scalable fingerprinting suitable for catalog management and broadcast monitoring, while highlighting avenues for targeted augmentation and adversarial testing to close remaining gaps.

Abstract

The proliferation of distorted, compressed, and manipulated music on modern media platforms like TikTok motivates the development of more robust audio fingerprinting techniques to identify the sources of musical recordings. In this paper, we develop and evaluate new neural audio fingerprinting techniques with the aim of improving their robustness. We make two contributions to neural fingerprinting methodology: (1) we use a pretrained music foundation model as the backbone of the neural architecture and (2) we expand the use of data augmentation to train fingerprinting models under a wide variety of audio manipulations, including time streching, pitch modulation, compression, and filtering. We systematically evaluate our methods in comparison to two state-of-the-art neural fingerprinting models: NAFP and GraFPrint. Results show that fingerprints extracted with music foundation models (e.g., MuQ, MERT) consistently outperform models trained from scratch or pretrained on non-musical audio. Segment-level evaluation further reveals their capability to accurately localize fingerprint matches, an important practical feature for catalog management.

Paper Structure

This paper contains 8 sections, 1 equation, 1 figure, 2 tables.

Figures (1)

  • Figure 1: The contrastive learning framework for neural audio fingerprinting. Original and augmented audio (e.g., audio with noise, reverb, time/pitch changes) are passed through a shared encoder, followed by a projection head. The resulting embeddings ($z$ and $z'$) are optimized using a contrastive loss to encourage invariance to audio degradations.