D2Fusion: Dual-domain Fusion with Feature Superposition for Deepfake Detection
Xueqi Qiu, Xingyu Miao, Fan Wan, Haoran Duan, Tejal Shah, Varun Ojhab, Yang Longa, Rajiv Ranjan
TL;DR
D2Fusion tackles the generalization gap in Deepfake detection by integrating dual-domain artifact cues through a bi-directional spatial attention module and a fine-grained frequency attention module based on DCT. A novel feature superposition strategy converts domain features into wave-like tokens with phase-aware fusion, amplifying differences between authentic and forged regions. Extensive experiments across FF++ variants, Celeb-DF, DFDC, and DFR demonstrate strong intra- and cross-dataset performance gains and robust localization of manipulated regions. The approach offers practical impact for real-world detection under varied forgery techniques, while noting limitations under high video compression that warrant future work on frequency transforms and low-quality video robustness.
Abstract
Deepfake detection is crucial for curbing the harm it causes to society. However, current Deepfake detection methods fail to thoroughly explore artifact information across different domains due to insufficient intrinsic interactions. These interactions refer to the fusion and coordination after feature extraction processes across different domains, which are crucial for recognizing complex forgery clues. Focusing on more generalized Deepfake detection, in this work, we introduce a novel bi-directional attention module to capture the local positional information of artifact clues from the spatial domain. This enables accurate artifact localization, thus addressing the coarse processing with artifact features. To further address the limitation that the proposed bi-directional attention module may not well capture global subtle forgery information in the artifact feature (e.g., textures or edges), we employ a fine-grained frequency attention module in the frequency domain. By doing so, we can obtain high-frequency information in the fine-grained features, which contains the global and subtle forgery information. Although these features from the diverse domains can be effectively and independently improved, fusing them directly does not effectively improve the detection performance. Therefore, we propose a feature superposition strategy that complements information from spatial and frequency domains. This strategy turns the feature components into the form of wave-like tokens, which are updated based on their phase, such that the distinctions between authentic and artifact features can be amplified. Our method demonstrates significant improvements over state-of-the-art (SOTA) methods on five public Deepfake datasets in capturing abnormalities across different manipulated operations and real-life.
