AVT2-DWF: Improving Deepfake Detection with Audio-Visual Fusion and Dynamic Weighting Strategies

Rui Wang; Dengpan Ye; Long Tang; Yunming Zhang; Jiacheng Deng

AVT2-DWF: Improving Deepfake Detection with Audio-Visual Fusion and Dynamic Weighting Strategies

Rui Wang, Dengpan Ye, Long Tang, Yunming Zhang, Jiacheng Deng

TL;DR

AVT^2-DWF introduces a dual-transformer framework with $n$-frame-wise facial tokenization and MFCC-based audio processing, connected by a Dynamic Weight Fusion mechanism that dynamically weights visual and audio cues. The face and audio transformers extract modality-specific features, which are then fused via a two-layer cross-modal attention to produce a robust detection signal. Across three public benchmarks, the method achieves state-of-the-art intra-dataset results and strong cross-dataset generalization, with ablations confirming the critical roles of both $n$-frame-wise tokenization and DWF. This approach demonstrates that preserving temporal continuity and learning adaptive modality weights enhances deepfake detection in realistic, multi-modal scenarios.

Abstract

With the continuous improvements of deepfake methods, forgery messages have transitioned from single-modality to multi-modal fusion, posing new challenges for existing forgery detection algorithms. In this paper, we propose AVT2-DWF, the Audio-Visual dual Transformers grounded in Dynamic Weight Fusion, which aims to amplify both intra- and cross-modal forgery cues, thereby enhancing detection capabilities. AVT2-DWF adopts a dual-stage approach to capture both spatial characteristics and temporal dynamics of facial expressions. This is achieved through a face transformer with an n-frame-wise tokenization strategy encoder and an audio transformer encoder. Subsequently, it uses multi-modal conversion with dynamic weight fusion to address the challenge of heterogeneous information fusion between audio and visual modalities. Experiments on DeepfakeTIMIT, FakeAVCeleb, and DFDC datasets indicate that AVT2-DWF achieves state-of-the-art performance intra- and cross-dataset Deepfake detection. Code is available at https://github.com/raining-dev/AVT2-DWF.

AVT2-DWF: Improving Deepfake Detection with Audio-Visual Fusion and Dynamic Weighting Strategies

TL;DR

AVT^2-DWF introduces a dual-transformer framework with

-frame-wise facial tokenization and MFCC-based audio processing, connected by a Dynamic Weight Fusion mechanism that dynamically weights visual and audio cues. The face and audio transformers extract modality-specific features, which are then fused via a two-layer cross-modal attention to produce a robust detection signal. Across three public benchmarks, the method achieves state-of-the-art intra-dataset results and strong cross-dataset generalization, with ablations confirming the critical roles of both

-frame-wise tokenization and DWF. This approach demonstrates that preserving temporal continuity and learning adaptive modality weights enhances deepfake detection in realistic, multi-modal scenarios.

Abstract

Paper Structure (14 sections, 6 equations, 3 figures, 5 tables)

This paper contains 14 sections, 6 equations, 3 figures, 5 tables.

Introduction
Method
Face Transformer Encoder
Audio Transformer Encoder
Multi-Modal Transformer with Dynamic Weight Fusion
Experiment
Dataset
Implementation
Comparisons With The State-of-the-arts
Cross-dataset Evaluation
Ablation Study
Benefit of DWF module
Benefit of $n$-frame-wize tokenize
Conclusion

Figures (3)

Figure 1: The top image illustrates the conventional approach of packaging video frames into a patch-wise tokenize scheme. The bottom image showcases our proposed method, employing an $n\text{-frame-wise}$ tokenize strategy.
Figure 2: The AVT$^2$-DWF training process is as follows: the audio is combined into MFCC features and fed into the audio conversion encoder for training; at the same time, each group of 30 visual frames is input into the face conversion encoder for training. Their outputs are concatenated and fed into a dynamic weight fusion (DWF) train to obtain audio and visual weight features. These weighted features are multiplied with the outputs of the audio and visual feature encoders and finally concatenated together for detection.
Figure 3: DWF Architecture. The input comprises features $\mathbf{F}_\ell$ and $\mathbf{A}_\ell$ extracted by the face and audio transformer encoders. Initially, weights $W_F$ and $W_A$ are initialized, and the MHCA is utilized to train weight values relevant to the modalities. Subsequently, these weight values are propagated to the subsequent layer of DWF training.

AVT2-DWF: Improving Deepfake Detection with Audio-Visual Fusion and Dynamic Weighting Strategies

TL;DR

Abstract

AVT2-DWF: Improving Deepfake Detection with Audio-Visual Fusion and Dynamic Weighting Strategies

Authors

TL;DR

Abstract

Table of Contents

Figures (3)