Table of Contents
Fetching ...

How Do Neural Spoofing Countermeasures Detect Partially Spoofed Audio?

Tianchi Liu, Lin Zhang, Rohan Kumar Das, Yi Ma, Ruijie Tao, Haizhou Li

TL;DR

This work investigates how neural countermeasures detect partially spoofed audio by applying Grad-CAM to interpret decision-making and introducing the RCQ metric for quantitative analysis. It shows that CMs trained on partially spoofed data primarily attend to transition-region artifacts created during concatenation, in contrast to CMs trained on fully spoofed data which focus on differences between bona fide and spoofed parts. The authors adapt Grad-CAM to speech, propose a high-resolution Grad-CAM setup with SE-Res1D enhancements, and evaluate on the PartialSpoof dataset alongside ASVspoof data. The findings provide actionable insights for CM design and dataset construction and establish interpretability as a foundation for partial-spoof detection research, with RCQ enabling reliable, dataset-wide analysis of attention patterns.

Abstract

Partially manipulating a sentence can greatly change its meaning. Recent work shows that countermeasures (CMs) trained on partially spoofed audio can effectively detect such spoofing. However, the current understanding of the decision-making process of CMs is limited. We utilize Grad-CAM and introduce a quantitative analysis metric to interpret CMs' decisions. We find that CMs prioritize the artifacts of transition regions created when concatenating bona fide and spoofed audio. This focus differs from that of CMs trained on fully spoofed audio, which concentrate on the pattern differences between bona fide and spoofed parts. Our further investigation explains the varying nature of CMs' focus while making correct or incorrect predictions. These insights provide a basis for the design of CM models and the creation of datasets. Moreover, this work lays a foundation of interpretability in the field of partial spoofed audio detection that has not been well explored previously.

How Do Neural Spoofing Countermeasures Detect Partially Spoofed Audio?

TL;DR

This work investigates how neural countermeasures detect partially spoofed audio by applying Grad-CAM to interpret decision-making and introducing the RCQ metric for quantitative analysis. It shows that CMs trained on partially spoofed data primarily attend to transition-region artifacts created during concatenation, in contrast to CMs trained on fully spoofed data which focus on differences between bona fide and spoofed parts. The authors adapt Grad-CAM to speech, propose a high-resolution Grad-CAM setup with SE-Res1D enhancements, and evaluate on the PartialSpoof dataset alongside ASVspoof data. The findings provide actionable insights for CM design and dataset construction and establish interpretability as a foundation for partial-spoof detection research, with RCQ enabling reliable, dataset-wide analysis of attention patterns.

Abstract

Partially manipulating a sentence can greatly change its meaning. Recent work shows that countermeasures (CMs) trained on partially spoofed audio can effectively detect such spoofing. However, the current understanding of the decision-making process of CMs is limited. We utilize Grad-CAM and introduce a quantitative analysis metric to interpret CMs' decisions. We find that CMs prioritize the artifacts of transition regions created when concatenating bona fide and spoofed audio. This focus differs from that of CMs trained on fully spoofed audio, which concentrate on the pattern differences between bona fide and spoofed parts. Our further investigation explains the varying nature of CMs' focus while making correct or incorrect predictions. These insights provide a basis for the design of CM models and the creation of datasets. Moreover, this work lays a foundation of interpretability in the field of partial spoofed audio detection that has not been well explored previously.
Paper Structure (14 sections, 3 equations, 5 figures, 3 tables)

This paper contains 14 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Illustration of partially spoofed speech changing the meaning of a sentence.
  • Figure 2: An example of counterfactual explanations with Grad-CAM gradcam. Figures are produced using jacobgilpytorchcam.
  • Figure 3: Block diagrams illustrating the structures of the (a) SSL-gMLPs and (b) SSL-Res1D models. The dashed yellow line with a camera icon indicates the Grad-CAM.
  • Figure 4: Visualization of the waveform and frame-level Grad-CAM scores for CON_E_0033629.wav from the evaluation set.
  • Figure 5: The prediction score distribution of SSL-Res1D trained on PartialSpoof and tested on the evaluation set, along with the line chart showing the RQC of the five types of segments across 11 sample groups of partially spoofed samples.