Table of Contents
Fetching ...

Enhancing Generalization in Audio Deepfake Detection: A Neural Collapse based Sampling and Training Approach

Mohammed Yousif, Jonat John Mathew, Huzaifa Pallan, Agamjeet Singh Padda, Syed Daniyal Shah, Sara Adamski, Madhu Reddiboina, Arjun Pankajakshan

TL;DR

The paper tackles the generalization problem in audio deepfake detection, where models trained on one dataset poorly transfer to others due to diverse generation algorithms. It introduces a neural collapse-based sampling strategy that leverages penultimate embeddings from pre-trained models to select representative real and fake samples, constructing a compact, diverse training database. Experiments on ASVspoof 2019 LA show that models trained with this sampling approach can generalize to the In-the-wild distribution with comparable performance yet require substantially less training data, with a tiny ResNet achieving notable cross-domain results. The work demonstrates data-efficient generalization gains and outlines future refinements for fake-data sampling, offering a scalable path toward robust audio deepfake detection.

Abstract

Generalization in audio deepfake detection presents a significant challenge, with models trained on specific datasets often struggling to detect deepfakes generated under varying conditions and unknown algorithms. While collectively training a model using diverse datasets can enhance its generalization ability, it comes with high computational costs. To address this, we propose a neural collapse-based sampling approach applied to pre-trained models trained on distinct datasets to create a new training database. Using ASVspoof 2019 dataset as a proof-of-concept, we implement pre-trained models with Resnet and ConvNext architectures. Our approach demonstrates comparable generalization on unseen data while being computationally efficient, requiring less training data. Evaluation is conducted using the In-the-wild dataset.

Enhancing Generalization in Audio Deepfake Detection: A Neural Collapse based Sampling and Training Approach

TL;DR

The paper tackles the generalization problem in audio deepfake detection, where models trained on one dataset poorly transfer to others due to diverse generation algorithms. It introduces a neural collapse-based sampling strategy that leverages penultimate embeddings from pre-trained models to select representative real and fake samples, constructing a compact, diverse training database. Experiments on ASVspoof 2019 LA show that models trained with this sampling approach can generalize to the In-the-wild distribution with comparable performance yet require substantially less training data, with a tiny ResNet achieving notable cross-domain results. The work demonstrates data-efficient generalization gains and outlines future refinements for fake-data sampling, offering a scalable path toward robust audio deepfake detection.

Abstract

Generalization in audio deepfake detection presents a significant challenge, with models trained on specific datasets often struggling to detect deepfakes generated under varying conditions and unknown algorithms. While collectively training a model using diverse datasets can enhance its generalization ability, it comes with high computational costs. To address this, we propose a neural collapse-based sampling approach applied to pre-trained models trained on distinct datasets to create a new training database. Using ASVspoof 2019 dataset as a proof-of-concept, we implement pre-trained models with Resnet and ConvNext architectures. Our approach demonstrates comparable generalization on unseen data while being computationally efficient, requiring less training data. Evaluation is conducted using the In-the-wild dataset.
Paper Structure (9 sections, 1 equation, 2 figures, 2 tables, 2 algorithms)

This paper contains 9 sections, 1 equation, 2 figures, 2 tables, 2 algorithms.

Figures (2)

  • Figure 1: Visualization of the penultimate embedding for real and fake classes in the ASVspoof $2019$ training database.
  • Figure 2: A schematic representation of our proposed methodology. Red data points represent the fake class, green data points represent the real class, and blue data points represent the class means.