Table of Contents
Fetching ...

Beyond Detection: Multi-Scale Hidden-Code for Natural Image Deepfake Recovery and Factual Retrieval

Yuan-Chih Chen, Chun-Shien Lu

TL;DR

A unified hidden-code recovery framework that enables both retrieval and restoration from post-hoc and in-generation watermarking paradigms, and encodes semantic and perceptual information into a compact hidden-code representation, refined through multi-scale vector quantization, and enhances contextual reasoning via conditional Transformer modules.

Abstract

Recent advances in image authenticity have primarily focused on deepfake detection and localization, leaving recovery of tampered contents for factual retrieval relatively underexplored. We propose a unified hidden-code recovery framework that enables both retrieval and restoration from post-hoc and in-generation watermarking paradigms. Our method encodes semantic and perceptual information into a compact hidden-code representation, refined through multi-scale vector quantization, and enhances contextual reasoning via conditional Transformer modules. To enable systematic evaluation for natural images, we construct ImageNet-S, a benchmark that provides paired image-label factual retrieval tasks. Extensive experiments on ImageNet-S demonstrate that our method exhibits promising retrieval and reconstruction performance while remaining fully compatible with diverse watermarking pipelines. This framework establishes a foundation for general-purpose image recovery beyond detection and localization.

Beyond Detection: Multi-Scale Hidden-Code for Natural Image Deepfake Recovery and Factual Retrieval

TL;DR

A unified hidden-code recovery framework that enables both retrieval and restoration from post-hoc and in-generation watermarking paradigms, and encodes semantic and perceptual information into a compact hidden-code representation, refined through multi-scale vector quantization, and enhances contextual reasoning via conditional Transformer modules.

Abstract

Recent advances in image authenticity have primarily focused on deepfake detection and localization, leaving recovery of tampered contents for factual retrieval relatively underexplored. We propose a unified hidden-code recovery framework that enables both retrieval and restoration from post-hoc and in-generation watermarking paradigms. Our method encodes semantic and perceptual information into a compact hidden-code representation, refined through multi-scale vector quantization, and enhances contextual reasoning via conditional Transformer modules. To enable systematic evaluation for natural images, we construct ImageNet-S, a benchmark that provides paired image-label factual retrieval tasks. Extensive experiments on ImageNet-S demonstrate that our method exhibits promising retrieval and reconstruction performance while remaining fully compatible with diverse watermarking pipelines. This framework establishes a foundation for general-purpose image recovery beyond detection and localization.
Paper Structure (36 sections, 20 equations, 6 figures, 6 tables)

This paper contains 36 sections, 20 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Overall workflow of the proposed method. In the multi-scale token watermarking stage, our goal is to protect an image $I$ by embedding image content–related information to produce a watermarked image $I_w$. The original image $I$ is first converted into quantized multi-scale tokens $z_{s_1}, \ldots, z_{s_k}$ using a VQ-VAE. These tokens are then flattened and transformed into a bit string $h$. The bit string $h$ is embedded into $I$ as a content watermark through the watermark injection encoder $E_w$, where the embedding length is constrained by the watermark capacity $|m|$ such that $|h| \leq |m|$. When a tampered image $I_d$ is generated from the watermarked image $I_w$ by malicious manipulations (e.g., object removal or inpainting), our method enables deepfake recovery using the content-related information embedded in $I_w$. The tampered image $I_d$ is decoded by the watermark decoder $D_w$ to extract the hidden quantized multi-scale tokens $h'$ and by the localization decoder $D_{loc}$ to produce a deepfake localization map $M_{loc}$. Finally, the multi-scale recovery transformer reconstructs the tampered image $I_d$ based on $h'$, $M_{loc}$, and the quantized tokens $h_d$ of the tampered image.
  • Figure 2: Visualization of the dropout effect during quantizer (VQ-VAE) training. From left to right, each column shows the reconstructed images using token maps $(z_{s_1}, \ldots, z_{s_k})$ with $1\leq k\leq 10$. Dropout encourages semantic robustness across scales, allowing lower-level representations (smaller $k$) to retain more global structure and meaningful content. In contrast, the model trained without dropout produces coarse, less informative reconstructions at lower scales.
  • Figure 3: Illustration of proposed conditional deepfake recovery Transformer. The predicted hidden codes $h'$ are used as conditional tokens to guide the generation process, enabling context-aware reconstruction. The untampered regions $\tilde{z}_{s_i}$ of deepfaked image provide additional contextual cues, helping the model produce more detailed and consistent predictions at higher scales.
  • Figure 4: Qualitative comparison of tampered image recovery performance. The first two columns show the Original Image and its Tampering mask. The third column, Deepfaked Image, shows the tampered image generated using Stable Diffusion based on the mask. While general recovery methods (HINet jing2021hinet, RePaint lugmayr2022repaint) often fail to maintain semantic consistency or introduce severe distortions. The general quantization methods (VQGAN, VAR) with our watermarking scheme struggle with texture preservation, our method successfully reconstructs the masked region, restoring the identity and realistic appearance of the original subjects (birds and an ostrich).
  • Figure 5: Illustration of single-scale versus multi-scale quantization. (a) Single-scale quantization: each token $x_i$ corresponds to a specific spatial position in the image, and the transformer predicts $x_{i}$ sequentially based on all preceding tokens $(x_1,x_2, \ldots, x_{i-1})$. (b) Multi-scale quantization: each token map $z_{s_i}$ represents a group of tokens at a particular scale $s_i$. The transformer predicts $z_{s_i}$ conditioned on all previous token maps $(z_{s_1},z_{s_2}, \ldots, z_{s_{i-1}})$.
  • ...and 1 more figures