Table of Contents
Fetching ...

Room Impulse Responses help attackers to evade Deep Fake Detection

Hieu-Thi Luong, Duc-Tuan Truong, Kong Aik Lee, Eng Siong Chng

TL;DR

This work analyzes how room impulse responses (RIRs) can both weaken and strengthen fake speech detection in ASVspoof 2021 benchmarks. It demonstrates that reverberant fake speech can double the EER of state-of-the-art detectors, highlighting a real-world vulnerability. The authors propose large-scale synthetic and simulated RIR augmentation to train more robust detectors, showing substantial gains—most notably reducing DF EER from 2.58% to 2.13%—and improved generalization across codecs. Overall, RIR-based data augmentation offers a model-agnostic, scalable defense against evolving audio manipulation, with synthetic RIRs providing the strongest robustness gains.

Abstract

The ASVspoof 2021 benchmark, a widely-used evaluation framework for anti-spoofing, consists of two subsets: Logical Access (LA) and Deepfake (DF), featuring samples with varied coding characteristics and compression artifacts. Notably, the current state-of-the-art (SOTA) system boasts impressive performance, achieving an Equal Error Rate (EER) of 0.87% on the LA subset and 2.58% on the DF. However, benchmark accuracy is no guarantee of robustness in real-world scenarios. This paper investigates the effectiveness of utilizing room impulse responses (RIRs) to enhance fake speech and increase their likelihood of evading fake speech detection systems. Our findings reveal that this simple approach significantly improves the evasion rate, doubling the SOTA system's EER. To counter this type of attack, We augmented training data with a large-scale synthetic/simulated RIR dataset. The results demonstrate significant improvement on both reverberated fake speech and original samples, reducing DF task EER to 2.13%.

Room Impulse Responses help attackers to evade Deep Fake Detection

TL;DR

This work analyzes how room impulse responses (RIRs) can both weaken and strengthen fake speech detection in ASVspoof 2021 benchmarks. It demonstrates that reverberant fake speech can double the EER of state-of-the-art detectors, highlighting a real-world vulnerability. The authors propose large-scale synthetic and simulated RIR augmentation to train more robust detectors, showing substantial gains—most notably reducing DF EER from 2.58% to 2.13%—and improved generalization across codecs. Overall, RIR-based data augmentation offers a model-agnostic, scalable defense against evolving audio manipulation, with synthetic RIRs providing the strongest robustness gains.

Abstract

The ASVspoof 2021 benchmark, a widely-used evaluation framework for anti-spoofing, consists of two subsets: Logical Access (LA) and Deepfake (DF), featuring samples with varied coding characteristics and compression artifacts. Notably, the current state-of-the-art (SOTA) system boasts impressive performance, achieving an Equal Error Rate (EER) of 0.87% on the LA subset and 2.58% on the DF. However, benchmark accuracy is no guarantee of robustness in real-world scenarios. This paper investigates the effectiveness of utilizing room impulse responses (RIRs) to enhance fake speech and increase their likelihood of evading fake speech detection systems. Our findings reveal that this simple approach significantly improves the evasion rate, doubling the SOTA system's EER. To counter this type of attack, We augmented training data with a large-scale synthetic/simulated RIR dataset. The results demonstrate significant improvement on both reverberated fake speech and original samples, reducing DF task EER to 2.13%.
Paper Structure (15 sections, 1 equation, 4 figures, 1 table)

This paper contains 15 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: Scatter plot of T60 and DRR for MIT RIRs.
  • Figure 2: Scatter plot of T60 and DRR for the synthetic RIR test set.
  • Figure 3: FAR (%) of the baseline systems (A, B, and C) at the EER threshold tested on the reverberant C1R2 evaluation set, results split by T60 and DRR values of the augmented RIRs.
  • Figure 4: FAR (%) of the systems trained with recorded RIRs (top row), synthetic RIRs (middle row), and simulated RIRs (bottom row) at the EER threshold test on the C1R2 set, results are split by T60 and DRR values of augmented RIRs.