ForensicFlow: A Tri-Modal Adaptive Network for Robust Deepfake Detection
Mohammad Romani
TL;DR
Deepfakes increasingly evade single-branch detectors, prompting the need for robust, multi-domain forensic analysis. ForensicFlow integrates three specialized streams—RGB-Spatial, Texture-Microscopic, and Frequency Analysis—with temporal attention and adaptive fusion to jointly exploit appearance, texture, and spectral cues. Trained with progressive unfreezing and Focal Loss, it achieves an AUC of 0.9752 and F1 of 0.9408 on Celeb-DF(v2), while Grad-CAM analyses confirm focus on genuine manipulation regions. The work demonstrates that cross-domain fusion, temporal prioritization, and interpretable signals can substantially improve resilience to evolving deepfake techniques and support practical forensic deployment.
Abstract
Modern deepfakes evade detection by leaving subtle, domain-speci c artifacts that single branch networks miss. ForensicFlow addresses this by fusing evidence across three forensic dimensions: global visual inconsistencies (via ConvNeXt-tiny), ne-grained texture anomalies (via Swin Transformer-tiny), and spectral noise patterns (via CNN with channel attention). Our attention-based temporal pooling dynamically prioritizes high-evidence frames, while adaptive fusion weights each branch according to forgery type. Trained on CelebDF(v2) with Focal Loss, the model achieves AUC 0.9752, F1 0.9408, and accuracy 0.9208 out performing single-stream detectors. Ablation studies con rm branch synergy, and Grad-CAM visualizations validate focus on genuine manipulation regions (e.g., facial boundaries). This multi-domain fusion strategy establishes robustness against increasingly sophisticated forgeries.
