Representation-Based Data Quality Audits for Audio
Alvaro Gonzalez-Jimenez, Fabian Gröger, Linda Wermelinger, Andrin Bürli, Iason Kastanis, Simone Lionetti, Marc Pouly
TL;DR
This work transfers the SelfClean data auditing framework from images to audio, addressing off-topic content, near duplicates, and label errors in audio datasets. It combines dataset-specific representations with indicator functions to produce file-level rankings for OT, ND, and LE, evaluated via $AUROC$ and $AP$ on synthetic and natural corruptions across ESC-50, GTZAN, and a private industrial dataset. Key findings show that large, pre-trained audio encoders (e.g., BEATs, M2D) provide robust baselines, while intrinsic training on small corpora is generally less effective; SelfClean’s indicators deliver strong cross-issue performance and substantial annotation savings, with FoE reductions up to 97% for near duplicates. The approach delivers a production-ready, unified workflow for audio data-quality auditing that translates into meaningful human-in-the-loop efficiency and practical deployment benefits.
Abstract
Data quality issues such as off-topic samples, near duplicates, and label errors often limit the performance of audio-based systems. This paper addresses these issues by adapting SelfClean, a representation-to-rank data auditing framework, from the image to the audio domain. This approach leverages self-supervised audio representations to identify common data quality issues, creating ranked review lists that surface distinct issues within a single, unified process. The method is benchmarked on the ESC-50, GTZAN, and a proprietary industrial dataset, using both synthetic and naturally occurring corruptions. The results demonstrate that this framework achieves state-of-the-art ranking performance, often outperforming issue-specific baselines and enabling significant annotation savings by efficiently guiding human review.
