Table of Contents
Fetching ...

Beyond Semantic Priors: Mitigating Optimization Collapse for Generalizable Visual Forensics

Jipeng Liu, Haichao Shi, Siyu Xing, Rong Yin, Xiao-Yu Zhang

Abstract

While Vision-Language Models (VLMs) like CLIP have emerged as a dominant paradigm for generalizable deepfake detection, a representational disconnect remains: their semantic-centric pre-training is ill-suited for capturing non-semantic artifacts inherent to hyper-realistic synthesis. In this work, we identify a failure mode termed Optimization Collapse, where detectors trained with Sharpness-Aware Minimization (SAM) degenerate to random guessing on non-semantic forgeries once the perturbation radius exceeds a narrow threshold. To theoretically formalize this collapse, we propose the Critical Optimization Radius (COR) to quantify the geometric stability of the optimization landscape, and leverage the Gradient Signal-to-Noise Ratio (GSNR) to measure generalization potential. We establish a theorem proving that COR increases monotonically with GSNR, thereby revealing that the geometric instability of SAM optimization originates from degraded intrinsic generalization potential. This result identifies the layer-wise attenuation of GSNR as the root cause of Optimization Collapse in detecting non-semantic forgeries. Although naively reducing perturbation radius yields stable convergence under SAM, it merely treats the symptom without mitigating the intrinsic generalization degradation, necessitating enhanced gradient fidelity. Building on this insight, we propose the Contrastive Regional Injection Transformer (CoRIT), which integrates a computationally efficient Contrastive Gradient Proxy (CGP) with three training-free strategies: Region Refinement Mask to suppress CGP variance, Regional Signal Injection to preserve CGP magnitude, and Hierarchical Representation Integration to attain more generalizable representations. Extensive experiments demonstrate that CoRIT mitigates optimization collapse and achieves state-of-the-art generalization across cross-domain and universal forgery benchmarks.

Beyond Semantic Priors: Mitigating Optimization Collapse for Generalizable Visual Forensics

Abstract

While Vision-Language Models (VLMs) like CLIP have emerged as a dominant paradigm for generalizable deepfake detection, a representational disconnect remains: their semantic-centric pre-training is ill-suited for capturing non-semantic artifacts inherent to hyper-realistic synthesis. In this work, we identify a failure mode termed Optimization Collapse, where detectors trained with Sharpness-Aware Minimization (SAM) degenerate to random guessing on non-semantic forgeries once the perturbation radius exceeds a narrow threshold. To theoretically formalize this collapse, we propose the Critical Optimization Radius (COR) to quantify the geometric stability of the optimization landscape, and leverage the Gradient Signal-to-Noise Ratio (GSNR) to measure generalization potential. We establish a theorem proving that COR increases monotonically with GSNR, thereby revealing that the geometric instability of SAM optimization originates from degraded intrinsic generalization potential. This result identifies the layer-wise attenuation of GSNR as the root cause of Optimization Collapse in detecting non-semantic forgeries. Although naively reducing perturbation radius yields stable convergence under SAM, it merely treats the symptom without mitigating the intrinsic generalization degradation, necessitating enhanced gradient fidelity. Building on this insight, we propose the Contrastive Regional Injection Transformer (CoRIT), which integrates a computationally efficient Contrastive Gradient Proxy (CGP) with three training-free strategies: Region Refinement Mask to suppress CGP variance, Regional Signal Injection to preserve CGP magnitude, and Hierarchical Representation Integration to attain more generalizable representations. Extensive experiments demonstrate that CoRIT mitigates optimization collapse and achieves state-of-the-art generalization across cross-domain and universal forgery benchmarks.
Paper Structure (65 sections, 4 theorems, 52 equations, 11 figures, 14 tables)

This paper contains 65 sections, 4 theorems, 52 equations, 11 figures, 14 tables.

Key Result

Lemma 1

Under local $L_t$-Lipschitz smoothness (Assumption assum:local_curvature), the geometric stability condition $\mathbb{E}[\langle \nabla \mathcal{L}_t, \tilde{\mathbf{g}}_t \rangle] > 0$ holds if:

Figures (11)

  • Figure 1: Empirical observation of Optimization Collapse. (a) Detection performance at the standard SAM radius ($\rho=0.05$). (b) Layer-wise evolution of the Critical Optimization Radius (COR). The red region denotes the "Collapse Zone" (COR $< 0.05$). (c) Representative samples from the evaluated datasets.
  • Figure 2: AUC curves of models trained with Sharpness-Aware Minimization (SAM) across varying perturbation radii $\rho$. (a) Empirical analysis of Optimization Collapse. The Critical Optimization Radius (COR) (diamond markers) across various datasets. The red region highlights optimization failure at the standard radius ($\rho=0.05$). (b) AUC trends across the entire span of $\rho$. The gray dashed box indicates the localized region magnified on the right, highlighting AUC variations around the recommended radius ($\rho=0.05$).
  • Figure 3: GSNR trajectories during optimization. (a) GSNR values across four datasets at the 24-th layer. (b) Layer-wise GSNR values on NeuralTextures. Diamond markers denote the bottleneck step $t^*$.
  • Figure 4: Overview of the proposed Contrastive Regional Injection Transformer (CoRIT) pipeline. Given an original image and its Self-Blended Image (SBI) counterpart, both are fed through the frozen CLIP image encoder in parallel. The discrepancy of visual tokens serves as the Contrastive Gradient Proxy (CGP). At each transformer layer, CoRIT applies three training-free components: (i) The Region Refinement Mask (RRM) clusters visual tokens around region anchors derived from the CGP, filtering out outlier tokens. (ii) The Regional Signal Injection (RSI) aggregates the refined tokens via average pooling and injects them into additional Region Tokens through intra-layer residual connections. (iii) The Hierarchical Representation Integration (HRI) concatenates the class token and region tokens from an intermediate layer $l_{\text{mid}}$ and the final layer $L$ for binary classification.
  • Figure 5: Training loss curves of compared methods illustrating generalization stability. Dashed lines indicate intra-dataset evaluation; solid lines indicate cross-dataset evaluation.
  • ...and 6 more figures

Theorems & Definitions (12)

  • Lemma 1: Local Stability Bound
  • proof
  • Definition 1: Critical Optimization Radius
  • Theorem 1: COR–GSNR Stability Decomposition
  • proof
  • Remark 1
  • Lemma 2: Well-Posedness of the Misspecification Term
  • proof
  • Remark 2: Behavior across optimization regimes
  • Corollary 1: COR Monotonicity and Collapse Equivalence
  • ...and 2 more