Table of Contents
Fetching ...

Fine-Grained DINO Tuning with Dual Supervision for Face Forgery Detection

Tianxiang Zhang, Peipeng Yu, Zhihua Xia, Longchen Dai, Xiaoyu Zhou, Hui Gao

TL;DR

This work tackles the generalization limitations of current deepfake detectors by tuning DINOv2 with a DeepFake Fine-Grained Adapter (DFF-Adapter) that injects task-specific and shared low-rank adapters across all Transformer blocks. It introduces a Forgery-Aware Multi-Head Router to route subspace features to specialized LoRA experts and a Shared-Enhanced Task Fusion module to transfer fine-grained forgery cues to the authenticity task, all while keeping the backbone frozen. The approach achieves state-of-the-art or competitive results on intra- and cross-dataset benchmarks and across cross-manipulation scenarios, using only a small number of trainable parameters. This demonstrates strong generalization and practical potential for robust, efficient face forgery detection in real-world security settings.

Abstract

The proliferation of sophisticated deepfakes poses significant threats to information integrity. While DINOv2 shows promise for detection, existing fine-tuning approaches treat it as generic binary classification, overlooking distinct artifacts inherent to different deepfake methods. To address this, we propose a DeepFake Fine-Grained Adapter (DFF-Adapter) for DINOv2. Our method incorporates lightweight multi-head LoRA modules into every transformer block, enabling efficient backbone adaptation. DFF-Adapter simultaneously addresses authenticity detection and fine-grained manipulation type classification, where classifying forgery methods enhances artifact sensitivity. We introduce a shared branch propagating fine-grained manipulation cues to the authenticity head. This enables multi-task cooperative optimization, explicitly enhancing authenticity discrimination with manipulation-specific knowledge. Utilizing only 3.5M trainable parameters, our parameter-efficient approach achieves detection accuracy comparable to or even surpassing that of current complex state-of-the-art methods.

Fine-Grained DINO Tuning with Dual Supervision for Face Forgery Detection

TL;DR

This work tackles the generalization limitations of current deepfake detectors by tuning DINOv2 with a DeepFake Fine-Grained Adapter (DFF-Adapter) that injects task-specific and shared low-rank adapters across all Transformer blocks. It introduces a Forgery-Aware Multi-Head Router to route subspace features to specialized LoRA experts and a Shared-Enhanced Task Fusion module to transfer fine-grained forgery cues to the authenticity task, all while keeping the backbone frozen. The approach achieves state-of-the-art or competitive results on intra- and cross-dataset benchmarks and across cross-manipulation scenarios, using only a small number of trainable parameters. This demonstrates strong generalization and practical potential for robust, efficient face forgery detection in real-world security settings.

Abstract

The proliferation of sophisticated deepfakes poses significant threats to information integrity. While DINOv2 shows promise for detection, existing fine-tuning approaches treat it as generic binary classification, overlooking distinct artifacts inherent to different deepfake methods. To address this, we propose a DeepFake Fine-Grained Adapter (DFF-Adapter) for DINOv2. Our method incorporates lightweight multi-head LoRA modules into every transformer block, enabling efficient backbone adaptation. DFF-Adapter simultaneously addresses authenticity detection and fine-grained manipulation type classification, where classifying forgery methods enhances artifact sensitivity. We introduce a shared branch propagating fine-grained manipulation cues to the authenticity head. This enables multi-task cooperative optimization, explicitly enhancing authenticity discrimination with manipulation-specific knowledge. Utilizing only 3.5M trainable parameters, our parameter-efficient approach achieves detection accuracy comparable to or even surpassing that of current complex state-of-the-art methods.

Paper Structure

This paper contains 33 sections, 10 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Training and Inference Stages. During training, the frozen DINOv2 backbone with the DFF-Adapter is augmented by three adapter heads: authenticity, forgery-type, and shared. The authenticity and forgery-type branches are jointly optimized, while the shared branch captures fine-grained forgery cues and transfers them to the authenticity stream. During inference, only the fused authenticity and shared branches are used for face forgery detection.
  • Figure 2: The framework of our method augments a frozen DINOv2 backbone with DFF-Adapters placed in each Transformer block. Each adapter contains three low-rank heads—authenticity, forgery-type, and shared—whose multi-head routers select the top-3 LoRA experts per feature subspace. The shared head transfers fine-grained cues to the authenticity stream. During training, the authenticity and forgery-type CLS tokens are supervised by a binary cross-entropy loss $L_{\text{bce}}$ and a multi-class loss $L_{\text{ftc}}$, respectively.
  • Figure 3: T–SNE visualizations on two datasets: top row shows intra-dataset (FF++), bottom row shows cross-dataset (CDF-v2).
  • Figure 4: Grad-CAM visualizations on the CDF2, DFDC, and DFDCP datasets. Our method produces more focused and interpretable attention maps compared to the DINOv2 baseline.