Table of Contents
Fetching ...

TokenPure: Watermark Removal through Tokenized Appearance and Structural Guidance

Pei Yang, Yepeng Liu, Kelly Peng, Yuan Gao, Yiren Song

TL;DR

TokenPure reframes watermark removal as conditional image generation guided by token-level appearance and layout priors, using a dual-branch diffusion-transformer framework to remove watermark signals without relying on the original watermark noise. The Appearance Adapter and Layout Controller provide complementary visual and geometric conditioning, fused through multimodal attention and a Joint Consistency Reward to enforce pixel-level coherence. Empirical results show SOTA removal efficacy and reconstruction fidelity across diverse watermark types, with ablations confirming the value of each component and a user study confirming perceptual superiority. The work advances watermark robustness testing and offers a practical, controllable method for watermark removal that preserves content integrity.

Abstract

In the digital economy era, digital watermarking serves as a critical basis for ownership proof of massive replicable content, including AI-generated and other virtual assets. Designing robust watermarks capable of withstanding various attacks and processing operations is even more paramount. We introduce TokenPure, a novel Diffusion Transformer-based framework designed for effective and consistent watermark removal. TokenPure solves the trade-off between thorough watermark destruction and content consistency by leveraging token-based conditional reconstruction. It reframes the task as conditional generation, entirely bypassing the initial watermark-carrying noise. We achieve this by decomposing the watermarked image into two complementary token sets: visual tokens for texture and structural tokens for geometry. These tokens jointly condition the diffusion process, enabling the framework to synthesize watermark-free images with fine-grained consistency and structural integrity. Comprehensive experiments show that TokenPure achieves state-of-the-art watermark removal and reconstruction fidelity, substantially outperforming existing baselines in both perceptual quality and consistency.

TokenPure: Watermark Removal through Tokenized Appearance and Structural Guidance

TL;DR

TokenPure reframes watermark removal as conditional image generation guided by token-level appearance and layout priors, using a dual-branch diffusion-transformer framework to remove watermark signals without relying on the original watermark noise. The Appearance Adapter and Layout Controller provide complementary visual and geometric conditioning, fused through multimodal attention and a Joint Consistency Reward to enforce pixel-level coherence. Empirical results show SOTA removal efficacy and reconstruction fidelity across diverse watermark types, with ablations confirming the value of each component and a user study confirming perceptual superiority. The work advances watermark robustness testing and offers a practical, controllable method for watermark removal that preserves content integrity.

Abstract

In the digital economy era, digital watermarking serves as a critical basis for ownership proof of massive replicable content, including AI-generated and other virtual assets. Designing robust watermarks capable of withstanding various attacks and processing operations is even more paramount. We introduce TokenPure, a novel Diffusion Transformer-based framework designed for effective and consistent watermark removal. TokenPure solves the trade-off between thorough watermark destruction and content consistency by leveraging token-based conditional reconstruction. It reframes the task as conditional generation, entirely bypassing the initial watermark-carrying noise. We achieve this by decomposing the watermarked image into two complementary token sets: visual tokens for texture and structural tokens for geometry. These tokens jointly condition the diffusion process, enabling the framework to synthesize watermark-free images with fine-grained consistency and structural integrity. Comprehensive experiments show that TokenPure achieves state-of-the-art watermark removal and reconstruction fidelity, substantially outperforming existing baselines in both perceptual quality and consistency.

Paper Structure

This paper contains 16 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The illustration of TokenPure framwork. It takes text, noise, the layout representation $I_{\text{lay}}$ (from edge detection and VAE encoding) of the watermarked image $I_{\text{wm}}$, and image prompt tokens (from the Image Encoder and Projection of $I_{\text{wm}}$) as inputs. Within the DiT Block, the Self-Attention and Feed Forward modules process these tokens, while the Layout Control & Appearance Adapter branch integrates spatial structure and visual details. The Layout Controller uses a LoRA module to optimize layout tokens’ QKV projection, and the Appearance Adapter performs trainable KV projection on image prompt tokens to interact with the Q projection of text and noise tokens via multimodal attention. Finally, the Joint Consistency Reward mechanism refines the result through one-step sampling and pixel-level reward loss, ensuring the reconstructed image’s consistency with the original.
  • Figure 2: Performance of TokenPure compared to other methods with a varying number of noise strengths on different watermarks, respectively. We invert the LPIPS/FID scores to ensure that the top-left represents better performance across all figures.
  • Figure 3: Qualitative comparison of different watermark removal attacks on different watermarking methods. It shows that our TokenPure preserves high visual consistency and quality under various watermarks.
  • Figure 4: User study results. Our TokenPure is preferred by human voters over other baselines. The results show the advantages of Tokenpure in both image consistency and degradation control.
  • Figure 5: Performance of TokenPure compared to other methods with a varying number of noise strengths on different watermarks, respectively. We invert the LPIPS/FID scores to ensure that the top-left represents better performance across all figures.
  • ...and 1 more figures