Table of Contents
Fetching ...

Advancing Continual Learning for Robust Deepfake Audio Classification

Feiyi Dong, Qingchen Tang, Yichen Bai, Zihan Wang

TL;DR

This work tackles the problem of robust deepfake audio detection under unseen spoofing attacks by introducing CADE, a continual learning framework that preserves past knowledge while adapting to new threats. CADE combines a fixed-memory replay strategy with three loss components: Knowledge Distillation, Attention Distillation via Grad-CAM, and an Embedding-based Positive Sample Alignment across multiple layers, formalized as $CADE = Replay + L_c + \alpha L_{kd} + \beta L_{ad} + \gamma L_{psa}$. Empirical evaluation on the ASVspoof2019 dataset demonstrates that CADE consistently outperforms traditional continual learning baselines across different spoofing types and backbones (RawNet2, LFCC-LCNN), achieving lower $EER$ even with limited memory. The findings suggest that CADE offers a practical, memory-efficient solution for adaptive, long-term audio anti-spoofing systems with real-world applicability.

Abstract

The emergence of new spoofing attacks poses an increasing challenge to audio security. Current detection methods often falter when faced with unseen spoofing attacks. Traditional strategies, such as retraining with new data, are not always feasible due to extensive storage. This paper introduces a novel continual learning method Continual Audio Defense Enhancer (CADE). First, by utilizing a fixed memory size to store randomly selected samples from previous datasets, our approach conserves resources and adheres to privacy constraints. Additionally, we also apply two distillation losses in CADE. By distillation in classifiers, CADE ensures that the student model closely resembles that of the teacher model. This resemblance helps the model retain old information while facing unseen data. We further refine our model's performance with a novel embedding similarity loss that extends across multiple depth layers, facilitating superior positive sample alignment. Experiments conducted on the ASVspoof2019 dataset show that our proposed method outperforms the baseline methods.

Advancing Continual Learning for Robust Deepfake Audio Classification

TL;DR

This work tackles the problem of robust deepfake audio detection under unseen spoofing attacks by introducing CADE, a continual learning framework that preserves past knowledge while adapting to new threats. CADE combines a fixed-memory replay strategy with three loss components: Knowledge Distillation, Attention Distillation via Grad-CAM, and an Embedding-based Positive Sample Alignment across multiple layers, formalized as . Empirical evaluation on the ASVspoof2019 dataset demonstrates that CADE consistently outperforms traditional continual learning baselines across different spoofing types and backbones (RawNet2, LFCC-LCNN), achieving lower even with limited memory. The findings suggest that CADE offers a practical, memory-efficient solution for adaptive, long-term audio anti-spoofing systems with real-world applicability.

Abstract

The emergence of new spoofing attacks poses an increasing challenge to audio security. Current detection methods often falter when faced with unseen spoofing attacks. Traditional strategies, such as retraining with new data, are not always feasible due to extensive storage. This paper introduces a novel continual learning method Continual Audio Defense Enhancer (CADE). First, by utilizing a fixed memory size to store randomly selected samples from previous datasets, our approach conserves resources and adheres to privacy constraints. Additionally, we also apply two distillation losses in CADE. By distillation in classifiers, CADE ensures that the student model closely resembles that of the teacher model. This resemblance helps the model retain old information while facing unseen data. We further refine our model's performance with a novel embedding similarity loss that extends across multiple depth layers, facilitating superior positive sample alignment. Experiments conducted on the ASVspoof2019 dataset show that our proposed method outperforms the baseline methods.
Paper Structure (17 sections, 6 equations, 1 figure, 4 tables)

This paper contains 17 sections, 6 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Method diagram of proposed CADE approach. Specifically, this diagram illustrates the training process at time $t$ for our novel method. At this time step, new data from task $t$ and a subset of data from task $t-1$ (selected using a replay strategy) are combined to form the input. This input is concurrently fed into the model from time $t-1$ (teacher model) for inference and the model at time $t$ (student model) for training. $L_c$ refers to the classification loss. The student model's training is further guided by three loss functions: $L_{\text{ad}}$ (Attention Distillation Loss derived from Grad-CAM), $L_{\text{psa}}$ (Positive Sample Alignment Loss), and $L_{\text{kd}}$ (Knowledge Distillation Loss), which collectively help the student model to retain the teacher model's knowledge.