Table of Contents
Fetching ...

Pindrop it! Audio and Visual Deepfake Countermeasures for Robust Detection and Fine Grained-Localization

Nicholas Klein, Hemlata Tak, James Fullwood, Krishna Regmi, Leonidas Spinoulas, Ganesh Sivaraman, Tianxiang Chen, Elie Khoury

TL;DR

The paper tackles robust detection and fine-grained localization of audio-visual deepfakes, with a focus on partial and localized manipulations. It introduces an ensemble of specialized audio and visual countermeasures for Task 1 and adopts an ActionFormer-inspired framework for Task 2 localization, achieving top performance on the TestA set. Key contributions include adapting LipForensics and audio backbones (ResNet, Multi-Resolution gMLP with Wav2Vec SSL) for partial detection, a polynomial fusion strategy, and a multi-model localization pipeline using Soft-NMS to combine proposals. The work demonstrates strong results on the AV-Deepfake1M++ dataset, underscoring the value of cross-modal fusion and localization-aware training for robust deepfake detection systems in real-world scenarios.

Abstract

The field of visual and audio generation is burgeoning with new state-of-the-art methods. This rapid proliferation of new techniques underscores the need for robust solutions for detecting synthetic content in videos. In particular, when fine-grained alterations via localized manipulations are performed in visual, audio, or both domains, these subtle modifications add challenges to the detection algorithms. This paper presents solutions for the problems of deepfake video classification and localization. The methods were submitted to the ACM 1M Deepfakes Detection Challenge, achieving the best performance in the temporal localization task and a top four ranking in the classification task for the TestA split of the evaluation dataset.

Pindrop it! Audio and Visual Deepfake Countermeasures for Robust Detection and Fine Grained-Localization

TL;DR

The paper tackles robust detection and fine-grained localization of audio-visual deepfakes, with a focus on partial and localized manipulations. It introduces an ensemble of specialized audio and visual countermeasures for Task 1 and adopts an ActionFormer-inspired framework for Task 2 localization, achieving top performance on the TestA set. Key contributions include adapting LipForensics and audio backbones (ResNet, Multi-Resolution gMLP with Wav2Vec SSL) for partial detection, a polynomial fusion strategy, and a multi-model localization pipeline using Soft-NMS to combine proposals. The work demonstrates strong results on the AV-Deepfake1M++ dataset, underscoring the value of cross-modal fusion and localization-aware training for robust deepfake detection systems in real-world scenarios.

Abstract

The field of visual and audio generation is burgeoning with new state-of-the-art methods. This rapid proliferation of new techniques underscores the need for robust solutions for detecting synthetic content in videos. In particular, when fine-grained alterations via localized manipulations are performed in visual, audio, or both domains, these subtle modifications add challenges to the detection algorithms. This paper presents solutions for the problems of deepfake video classification and localization. The methods were submitted to the ACM 1M Deepfakes Detection Challenge, achieving the best performance in the temporal localization task and a top four ranking in the classification task for the TestA split of the evaluation dataset.

Paper Structure

This paper contains 16 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Task 1 overview.
  • Figure 2: Proposed ResNet-based architecture for deepfake classification task.
  • Figure 3: Multi-Resolution gMLP pyramid with Wav2Vec SSL for full file classification.
  • Figure 4: Task 2 overview.
  • Figure 5: ResNet-based end-to-end pipeline for frame-level fake speech detection and localization task.
  • ...and 1 more figures