Table of Contents
Fetching ...

LOGER: Local--Global Ensemble for Robust Deepfake Detection in the Wild

Fei Wu, Dagong Lu, Mufeng Yao, Xinlei Xu, Fengjun Guo

Abstract

Robust deepfake detection in the wild remains challenging due to the ever-growing variety of manipulation techniques and uncontrolled real-world degradations. Forensic cues for deepfake detection reside at two complementary levels: global-level anomalies in semantics and statistics that require holistic image understanding, and local-level forgery traces concentrated in manipulated regions that are easily diluted by global averaging. Since no single backbone or input scale can effectively cover both levels, we propose LOGER, a LOcal--Global Ensemble framework for Robust deepfake detection. The global branch employs heterogeneous vision foundation model backbones at multiple resolutions to capture holistic anomalies with diverse visual priors. The local branch performs patch-level modeling with a Multiple Instance Learning top-$k$ aggregation strategy that selectively pools only the most suspicious regions, mitigating evidence dilution caused by the dominance of normal patches; dual-level supervision at both the aggregated image level and individual patch level keeps local responses discriminative. Because the two branches differ in both granularity and backbone, their errors are largely decorrelated, a property that logit-space fusion exploits for more robust prediction. LOGER achieves 2nd place in the NTIRE 2026 Robust Deepfake Detection Challenge, and further evaluation on multiple public benchmarks confirms its strong robustness and generalization across diverse manipulation methods and real-world degradation conditions.

LOGER: Local--Global Ensemble for Robust Deepfake Detection in the Wild

Abstract

Robust deepfake detection in the wild remains challenging due to the ever-growing variety of manipulation techniques and uncontrolled real-world degradations. Forensic cues for deepfake detection reside at two complementary levels: global-level anomalies in semantics and statistics that require holistic image understanding, and local-level forgery traces concentrated in manipulated regions that are easily diluted by global averaging. Since no single backbone or input scale can effectively cover both levels, we propose LOGER, a LOcal--Global Ensemble framework for Robust deepfake detection. The global branch employs heterogeneous vision foundation model backbones at multiple resolutions to capture holistic anomalies with diverse visual priors. The local branch performs patch-level modeling with a Multiple Instance Learning top- aggregation strategy that selectively pools only the most suspicious regions, mitigating evidence dilution caused by the dominance of normal patches; dual-level supervision at both the aggregated image level and individual patch level keeps local responses discriminative. Because the two branches differ in both granularity and backbone, their errors are largely decorrelated, a property that logit-space fusion exploits for more robust prediction. LOGER achieves 2nd place in the NTIRE 2026 Robust Deepfake Detection Challenge, and further evaluation on multiple public benchmarks confirms its strong robustness and generalization across diverse manipulation methods and real-world degradation conditions.

Paper Structure

This paper contains 34 sections, 3 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Overview of the proposed LOGER framework. Training data are sampled from a multi-source candidate pool with diverse degradation augmentation. The global branch performs full-image detection using DINOv3-H (M1, M2) and MetaCLIP2-H (M3) at multiple resolutions. The local branch employs DINOv3-L (M4, M5) with patch-level modeling and top-10% MIL pooling. All outputs are fused via logit-space averaging with test-time augmentation.
  • Figure 2: Robustness analysis under three degradation types: JPEG compression (left), spatial resizing (middle), and Gaussian blurring (right). AUC (%) is computed over 1,000 videos sampled from five cross-dataset benchmarks.
  • Figure 3: Representative failure cases on the NTIRE 2026 public test set. Top: false negatives (fake$\to$real). Bottom: false positives (real$\to$fake).