Table of Contents
Fetching ...

WMamba: Wavelet-based Mamba for Face Forgery Detection

Siran Peng, Tianshuo Zhang, Li Gao, Xiangyu Zhu, Haoyuan Zhang, Kai Pang, Zhen Lei

TL;DR

WMamba introduces a wavelet-based face forgery detector built on the Mamba framework, combining Dynamic Contour Convolution (DCConv) with a VMamba backbone to exploit slender facial contours and long-range spatial dependencies. The Hierarchical Wavelet Feature Extraction Branch (HWFEB) provides multi-level Haar DWT representations and DCConv-guided spatial attention, which are integrated into VMamba via spatial gating. Extensive cross-dataset and cross-manipulation experiments demonstrate state-of-the-art generalization and robustness, with ablations confirming the contributions of HWFEB, DCConv, and VMamba. The approach delivers accurate, efficient forgery detection from small patches and has strong practical impact for real-world anti-fraud and misinformation mitigation.

Abstract

The rapid evolution of deepfake generation technologies necessitates the development of robust face forgery detection algorithms. Recent studies have demonstrated that wavelet analysis can enhance the generalization abilities of forgery detectors. Wavelets effectively capture key facial contours, often slender, fine-grained, and globally distributed, that may conceal subtle forgery artifacts imperceptible in the spatial domain. However, current wavelet-based approaches fail to fully exploit the distinctive properties of wavelet data, resulting in sub-optimal feature extraction and limited performance gains. To address this challenge, we introduce WMamba, a novel wavelet-based feature extractor built upon the Mamba architecture. WMamba maximizes the utility of wavelet information through two key innovations. First, we propose Dynamic Contour Convolution (DCConv), which employs specially crafted deformable kernels to adaptively model slender facial contours. Second, by leveraging the Mamba architecture, our method captures long-range spatial relationships with linear complexity. This efficiency allows for the extraction of fine-grained, globally distributed forgery artifacts from small image patches. Extensive experiments show that WMamba achieves state-of-the-art (SOTA) performance, highlighting its effectiveness in face forgery detection.

WMamba: Wavelet-based Mamba for Face Forgery Detection

TL;DR

WMamba introduces a wavelet-based face forgery detector built on the Mamba framework, combining Dynamic Contour Convolution (DCConv) with a VMamba backbone to exploit slender facial contours and long-range spatial dependencies. The Hierarchical Wavelet Feature Extraction Branch (HWFEB) provides multi-level Haar DWT representations and DCConv-guided spatial attention, which are integrated into VMamba via spatial gating. Extensive cross-dataset and cross-manipulation experiments demonstrate state-of-the-art generalization and robustness, with ablations confirming the contributions of HWFEB, DCConv, and VMamba. The approach delivers accurate, efficient forgery detection from small patches and has strong practical impact for real-world anti-fraud and misinformation mitigation.

Abstract

The rapid evolution of deepfake generation technologies necessitates the development of robust face forgery detection algorithms. Recent studies have demonstrated that wavelet analysis can enhance the generalization abilities of forgery detectors. Wavelets effectively capture key facial contours, often slender, fine-grained, and globally distributed, that may conceal subtle forgery artifacts imperceptible in the spatial domain. However, current wavelet-based approaches fail to fully exploit the distinctive properties of wavelet data, resulting in sub-optimal feature extraction and limited performance gains. To address this challenge, we introduce WMamba, a novel wavelet-based feature extractor built upon the Mamba architecture. WMamba maximizes the utility of wavelet information through two key innovations. First, we propose Dynamic Contour Convolution (DCConv), which employs specially crafted deformable kernels to adaptively model slender facial contours. Second, by leveraging the Mamba architecture, our method captures long-range spatial relationships with linear complexity. This efficiency allows for the extraction of fine-grained, globally distributed forgery artifacts from small image patches. Extensive experiments show that WMamba achieves state-of-the-art (SOTA) performance, highlighting its effectiveness in face forgery detection.
Paper Structure (40 sections, 6 equations, 11 figures, 12 tables)

This paper contains 40 sections, 6 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Column 1: Example frames from the FaceForensics++ (FF++) dataset Rossler_2019_ICCV. Column 2: Wavelets capture facial contours that are often slender, fine-grained, and globally distributed. Column 3: WMamba maximizes the potential of wavelet data via two innovations: DCConv for precise modeling of slender facial contours (green squares) and Mamba gu2023mamba for extracting fine-grained and globally distributed forgery artifacts (yellow arrows). Column 4: Saliency maps generated by Grad-CAM Selvaraju_2017_ICCV reveal that our model focuses on key facial contours, which are rich in forgery artifacts.
  • Figure 2: Row 1: Graphical comparison of different convolutional paradigms for capturing slender structures. Standard convolutions and DCN struggle with such structures, DSConv offers limited representation, while DCConv demonstrates superior capability. Row 2: Graphical comparison of global perception capabilities. Only two flattening directions of VMamba are visualized. CNNs lack global perception ability. Transformers capture global context but require splitting the input image into larger patches due to quadratic complexity. Mamba exhibits partial global perception, while VMamba demonstrates enhanced global perception capability.
  • Figure 3: Overview of WMamba. The architecture comprises two main components: HWFEB and VMamba. HWFEB employs multi-level DWT to capture wavelet representations, WFEMs to generate spatial attention maps, and spatial gating mechanisms to integrate these maps into VMamba. VMamba then extracts wavelet-enhanced forgery cues and performs classification.
  • Figure 4: Visualization of different frequency sub-bands from the DWT. The LH, HL, and HH sub-bands capture high-frequency details in various orientations, while the LL sub-band represents a low-resolution approximation of the original RGB image. Highlighted within the red boxes, the high-frequency sub-bands reveal critical forgery traces that are otherwise less apparent.
  • Figure 5: Schematic diagram of the VSS block, with the SS2D mechanism at its core. This mechanism flattens input image patches along four principle directions, facilitating comprehensive global perception. $L$ denotes the number of patches.
  • ...and 6 more figures