Table of Contents
Fetching ...

DinoLizer: Learning from the Best for Generative Inpainting Localization

Minh Thong Doi, Jan Butora, Vincent Itier, Jérémie Boulanger, Patrick Bas

TL;DR

DinoLizer addresses the challenge of localizing manipulated regions produced by generative inpainting by combining a frozen DINOv2 Vision Transformer backbone with a lightweight patch-wise classification head. It uses a sliding-window inference strategy to generate dense, pixel-level localization maps that remain robust under common post-processing, and adopts a bias-free training regime that treats auto-encoded regions as pristine. The method achieves state-of-the-art localization performance across multiple inpainting datasets, with strong robustness to noise, resizing, and JPEG compression, and demonstrates favorable generalization compared to end-to-end finetuned backbones. This approach highlights the effectiveness of patch-token based localization on ViTs for forensic detection and suggests strong potential for scalable localization with future ViT architectures.

Abstract

We introduce DinoLizer, a DINOv2-based model for localizing manipulated regions in generative inpainting. Our method builds on a DINOv2 model pretrained to detect synthetic images on the B-Free dataset. We add a linear classification head on top of the Vision Transformer's patch embeddings to predict manipulations at a $14\times 14$ patch resolution. The head is trained to focus on semantically altered regions, treating non-semantic edits as part of the original content. Because the ViT accepts only fixed-size inputs, we use a sliding-window strategy to aggregate predictions over larger images; the resulting heatmaps are post-processed to refine the estimated binary manipulation masks. Empirical results show that DinoLizer surpasses state-of-the-art local manipulation detectors on a range of inpainting datasets derived from different generative models. It remains robust to common post-processing operations such as resizing, noise addition, and JPEG (double) compression. On average, DinoLizer achieves a 12\% higher Intersection-over-Union (IoU) than the next best model, with even greater gains after post-processing. Our experiments with off-the-shelf DINOv2 demonstrate the strong representational power of Vision Transformers for this task. Finally, extensive ablation studies comparing DINOv2 and its successor, DINOv3, in deepfake localization confirm DinoLizer's superiority. The code will be publicly available upon acceptance of the paper.

DinoLizer: Learning from the Best for Generative Inpainting Localization

TL;DR

DinoLizer addresses the challenge of localizing manipulated regions produced by generative inpainting by combining a frozen DINOv2 Vision Transformer backbone with a lightweight patch-wise classification head. It uses a sliding-window inference strategy to generate dense, pixel-level localization maps that remain robust under common post-processing, and adopts a bias-free training regime that treats auto-encoded regions as pristine. The method achieves state-of-the-art localization performance across multiple inpainting datasets, with strong robustness to noise, resizing, and JPEG compression, and demonstrates favorable generalization compared to end-to-end finetuned backbones. This approach highlights the effectiveness of patch-token based localization on ViTs for forensic detection and suggests strong potential for scalable localization with future ViT architectures.

Abstract

We introduce DinoLizer, a DINOv2-based model for localizing manipulated regions in generative inpainting. Our method builds on a DINOv2 model pretrained to detect synthetic images on the B-Free dataset. We add a linear classification head on top of the Vision Transformer's patch embeddings to predict manipulations at a patch resolution. The head is trained to focus on semantically altered regions, treating non-semantic edits as part of the original content. Because the ViT accepts only fixed-size inputs, we use a sliding-window strategy to aggregate predictions over larger images; the resulting heatmaps are post-processed to refine the estimated binary manipulation masks. Empirical results show that DinoLizer surpasses state-of-the-art local manipulation detectors on a range of inpainting datasets derived from different generative models. It remains robust to common post-processing operations such as resizing, noise addition, and JPEG (double) compression. On average, DinoLizer achieves a 12\% higher Intersection-over-Union (IoU) than the next best model, with even greater gains after post-processing. Our experiments with off-the-shelf DINOv2 demonstrate the strong representational power of Vision Transformers for this task. Finally, extensive ablation studies comparing DINOv2 and its successor, DINOv3, in deepfake localization confirm DinoLizer's superiority. The code will be publicly available upon acceptance of the paper.

Paper Structure

This paper contains 20 sections, 7 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Principle of DinoLizer: the image is decomposed into a set of $504\times 504$ overlapping crops which are fed to the DINOv2 model to provide embeddings of dimension $d$ for each patch plus a class token which is not considered here. A $1 \times 1$ trainable convolutional layer is used to infer a logit map, which is then fused with other overlapping maps in order to provide, after thresholding, a localization mask.
  • Figure 2: Visual comparison of forgery localization results on the CocoGlide dataset.
  • Figure 3: Visual comparison of forgery localization results on the TGIF dataset (top) and their JPEG QF 80 robustness (bottom).
  • Figure 4: F1 score performance under different types of perturbations on CocoGlide: (a) gaussian noise, (b) resizing, (c) JPEG compression, and (d) double JPEG compression.
  • Figure 5: F1 score performance under different types of perturbations on SAGI-SP: (a) gaussian noise, (b) resizing, (c) JPEG compression, and (d) double JPEG compression.
  • ...and 11 more figures