Table of Contents
Fetching ...

Representation-Based Data Quality Audits for Audio

Alvaro Gonzalez-Jimenez, Fabian Gröger, Linda Wermelinger, Andrin Bürli, Iason Kastanis, Simone Lionetti, Marc Pouly

TL;DR

This work transfers the SelfClean data auditing framework from images to audio, addressing off-topic content, near duplicates, and label errors in audio datasets. It combines dataset-specific representations with indicator functions to produce file-level rankings for OT, ND, and LE, evaluated via $AUROC$ and $AP$ on synthetic and natural corruptions across ESC-50, GTZAN, and a private industrial dataset. Key findings show that large, pre-trained audio encoders (e.g., BEATs, M2D) provide robust baselines, while intrinsic training on small corpora is generally less effective; SelfClean’s indicators deliver strong cross-issue performance and substantial annotation savings, with FoE reductions up to 97% for near duplicates. The approach delivers a production-ready, unified workflow for audio data-quality auditing that translates into meaningful human-in-the-loop efficiency and practical deployment benefits.

Abstract

Data quality issues such as off-topic samples, near duplicates, and label errors often limit the performance of audio-based systems. This paper addresses these issues by adapting SelfClean, a representation-to-rank data auditing framework, from the image to the audio domain. This approach leverages self-supervised audio representations to identify common data quality issues, creating ranked review lists that surface distinct issues within a single, unified process. The method is benchmarked on the ESC-50, GTZAN, and a proprietary industrial dataset, using both synthetic and naturally occurring corruptions. The results demonstrate that this framework achieves state-of-the-art ranking performance, often outperforming issue-specific baselines and enabling significant annotation savings by efficiently guiding human review.

Representation-Based Data Quality Audits for Audio

TL;DR

This work transfers the SelfClean data auditing framework from images to audio, addressing off-topic content, near duplicates, and label errors in audio datasets. It combines dataset-specific representations with indicator functions to produce file-level rankings for OT, ND, and LE, evaluated via and on synthetic and natural corruptions across ESC-50, GTZAN, and a private industrial dataset. Key findings show that large, pre-trained audio encoders (e.g., BEATs, M2D) provide robust baselines, while intrinsic training on small corpora is generally less effective; SelfClean’s indicators deliver strong cross-issue performance and substantial annotation savings, with FoE reductions up to 97% for near duplicates. The approach delivers a production-ready, unified workflow for audio data-quality auditing that translates into meaningful human-in-the-loop efficiency and practical deployment benefits.

Abstract

Data quality issues such as off-topic samples, near duplicates, and label errors often limit the performance of audio-based systems. This paper addresses these issues by adapting SelfClean, a representation-to-rank data auditing framework, from the image to the audio domain. This approach leverages self-supervised audio representations to identify common data quality issues, creating ranked review lists that surface distinct issues within a single, unified process. The method is benchmarked on the ESC-50, GTZAN, and a proprietary industrial dataset, using both synthetic and naturally occurring corruptions. The results demonstrate that this framework achieves state-of-the-art ranking performance, often outperforming issue-specific baselines and enabling significant annotation savings by efficiently guiding human review.

Paper Structure

This paper contains 10 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Ranking efficiency and annotation savings, where curves show $\mathrm{FoE}$ (fraction of effort, lower is better) as a function of recall for SelfClean's ranked lists on ESC-50 with synthetic corruption ($\alpha{=}0.05$) using BEATs embeddings. The steep curves for near duplicates and label errors indicate high review efficiency, enabling an operator to identify the majority of issues by inspecting a small fraction of the dataset.