AVT2-DWF: Improving Deepfake Detection with Audio-Visual Fusion and Dynamic Weighting Strategies
Rui Wang, Dengpan Ye, Long Tang, Yunming Zhang, Jiacheng Deng
TL;DR
AVT^2-DWF introduces a dual-transformer framework with $n$-frame-wise facial tokenization and MFCC-based audio processing, connected by a Dynamic Weight Fusion mechanism that dynamically weights visual and audio cues. The face and audio transformers extract modality-specific features, which are then fused via a two-layer cross-modal attention to produce a robust detection signal. Across three public benchmarks, the method achieves state-of-the-art intra-dataset results and strong cross-dataset generalization, with ablations confirming the critical roles of both $n$-frame-wise tokenization and DWF. This approach demonstrates that preserving temporal continuity and learning adaptive modality weights enhances deepfake detection in realistic, multi-modal scenarios.
Abstract
With the continuous improvements of deepfake methods, forgery messages have transitioned from single-modality to multi-modal fusion, posing new challenges for existing forgery detection algorithms. In this paper, we propose AVT2-DWF, the Audio-Visual dual Transformers grounded in Dynamic Weight Fusion, which aims to amplify both intra- and cross-modal forgery cues, thereby enhancing detection capabilities. AVT2-DWF adopts a dual-stage approach to capture both spatial characteristics and temporal dynamics of facial expressions. This is achieved through a face transformer with an n-frame-wise tokenization strategy encoder and an audio transformer encoder. Subsequently, it uses multi-modal conversion with dynamic weight fusion to address the challenge of heterogeneous information fusion between audio and visual modalities. Experiments on DeepfakeTIMIT, FakeAVCeleb, and DFDC datasets indicate that AVT2-DWF achieves state-of-the-art performance intra- and cross-dataset Deepfake detection. Code is available at https://github.com/raining-dev/AVT2-DWF.
