Referee: Reference-aware Audiovisual Deepfake Detection

Hyemin Boo; Eunsang Lee; Jiyoung Lee

Referee: Reference-aware Audiovisual Deepfake Detection

Hyemin Boo, Eunsang Lee, Jiyoung Lee

TL;DR

The paper tackles the challenge of generalizing audiovisual deepfake detection to unseen forgeries by introducing Referee, a reference-aware framework that leverages a one-shot reference video to enforce cross-modal identity consistency and temporal integrity. It introduces an Identity Bottleneck (IDB) that encodes speaker identity into learnable queries and an identity-matching mechanism that refines target identity tokens using reference cues, all integrated within an AV-Transformer for final detection. Auxiliary identity verification and a dedicated loss promote robust identity discrimination, boosting resilience to new manipulation methods. Across FakeAVCeleb, FaceForensics++, and KoDF, Referee achieves state-of-the-art results in cross-dataset and cross-lingual settings, demonstrating the value of cross-modal biometrics verification for deepfake detection. The approach offers practical improvements for real-world deployment by reducing reliance on low-level artifacts and increasing robustness to distribution shifts.

Abstract

Since deepfakes generated by advanced generative models have rapidly posed serious threats, existing audiovisual deepfake detection approaches struggle to generalize to unseen forgeries. We propose a novel reference-aware audiovisual deepfake detection method, called Referee. Speaker-specific cues from only one-shot examples are leveraged to detect manipulations beyond spatiotemporal artifacts. By matching and aligning identity-related queries from reference and target content into cross-modal features, Referee jointly reasons about audiovisual synchrony and identity consistency. Extensive experiments on FakeAVCeleb, FaceForensics++, and KoDF demonstrate that Referee achieves state-of-the-art performance on cross-dataset and cross-language evaluation protocols. Experimental results highlight the importance of cross-modal identity verification for future deepfake detection. The code is available at https://github.com/ewha-mmai/referee.

Referee: Reference-aware Audiovisual Deepfake Detection

TL;DR

Abstract

Referee: Reference-aware Audiovisual Deepfake Detection

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)