Table of Contents
Fetching ...

Referee: Reference-aware Audiovisual Deepfake Detection

Hyemin Boo, Eunsang Lee, Jiyoung Lee

TL;DR

The paper tackles the challenge of generalizing audiovisual deepfake detection to unseen forgeries by introducing Referee, a reference-aware framework that leverages a one-shot reference video to enforce cross-modal identity consistency and temporal integrity. It introduces an Identity Bottleneck (IDB) that encodes speaker identity into learnable queries and an identity-matching mechanism that refines target identity tokens using reference cues, all integrated within an AV-Transformer for final detection. Auxiliary identity verification and a dedicated loss promote robust identity discrimination, boosting resilience to new manipulation methods. Across FakeAVCeleb, FaceForensics++, and KoDF, Referee achieves state-of-the-art results in cross-dataset and cross-lingual settings, demonstrating the value of cross-modal biometrics verification for deepfake detection. The approach offers practical improvements for real-world deployment by reducing reliance on low-level artifacts and increasing robustness to distribution shifts.

Abstract

Since deepfakes generated by advanced generative models have rapidly posed serious threats, existing audiovisual deepfake detection approaches struggle to generalize to unseen forgeries. We propose a novel reference-aware audiovisual deepfake detection method, called Referee. Speaker-specific cues from only one-shot examples are leveraged to detect manipulations beyond spatiotemporal artifacts. By matching and aligning identity-related queries from reference and target content into cross-modal features, Referee jointly reasons about audiovisual synchrony and identity consistency. Extensive experiments on FakeAVCeleb, FaceForensics++, and KoDF demonstrate that Referee achieves state-of-the-art performance on cross-dataset and cross-language evaluation protocols. Experimental results highlight the importance of cross-modal identity verification for future deepfake detection. The code is available at https://github.com/ewha-mmai/referee.

Referee: Reference-aware Audiovisual Deepfake Detection

TL;DR

The paper tackles the challenge of generalizing audiovisual deepfake detection to unseen forgeries by introducing Referee, a reference-aware framework that leverages a one-shot reference video to enforce cross-modal identity consistency and temporal integrity. It introduces an Identity Bottleneck (IDB) that encodes speaker identity into learnable queries and an identity-matching mechanism that refines target identity tokens using reference cues, all integrated within an AV-Transformer for final detection. Auxiliary identity verification and a dedicated loss promote robust identity discrimination, boosting resilience to new manipulation methods. Across FakeAVCeleb, FaceForensics++, and KoDF, Referee achieves state-of-the-art results in cross-dataset and cross-lingual settings, demonstrating the value of cross-modal biometrics verification for deepfake detection. The approach offers practical improvements for real-world deployment by reducing reliance on low-level artifacts and increasing robustness to distribution shifts.

Abstract

Since deepfakes generated by advanced generative models have rapidly posed serious threats, existing audiovisual deepfake detection approaches struggle to generalize to unseen forgeries. We propose a novel reference-aware audiovisual deepfake detection method, called Referee. Speaker-specific cues from only one-shot examples are leveraged to detect manipulations beyond spatiotemporal artifacts. By matching and aligning identity-related queries from reference and target content into cross-modal features, Referee jointly reasons about audiovisual synchrony and identity consistency. Extensive experiments on FakeAVCeleb, FaceForensics++, and KoDF demonstrate that Referee achieves state-of-the-art performance on cross-dataset and cross-language evaluation protocols. Experimental results highlight the importance of cross-modal identity verification for future deepfake detection. The code is available at https://github.com/ewha-mmai/referee.

Paper Structure

This paper contains 11 sections, 4 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Referee introduces robust audiovisual deepfake detection guided by a one-shot reference example, verifying the cross-modal biometrics consistency as well as temporal unnaturalness.
  • Figure 2: The overall framework of Referee. Target and reference videos are encoded into audiovisual features, which are passed through IDB module to generate identity queries. The identity matching module refines the target tokens with respect to the reference tokens, after which the reference-aware identity queries, together with the target audiovisual features and a [CLS] token, are processed by the AV-Transformer for deepfake classification and identity matching.