TokenPure: Watermark Removal through Tokenized Appearance and Structural Guidance
Pei Yang, Yepeng Liu, Kelly Peng, Yuan Gao, Yiren Song
TL;DR
TokenPure reframes watermark removal as conditional image generation guided by token-level appearance and layout priors, using a dual-branch diffusion-transformer framework to remove watermark signals without relying on the original watermark noise. The Appearance Adapter and Layout Controller provide complementary visual and geometric conditioning, fused through multimodal attention and a Joint Consistency Reward to enforce pixel-level coherence. Empirical results show SOTA removal efficacy and reconstruction fidelity across diverse watermark types, with ablations confirming the value of each component and a user study confirming perceptual superiority. The work advances watermark robustness testing and offers a practical, controllable method for watermark removal that preserves content integrity.
Abstract
In the digital economy era, digital watermarking serves as a critical basis for ownership proof of massive replicable content, including AI-generated and other virtual assets. Designing robust watermarks capable of withstanding various attacks and processing operations is even more paramount. We introduce TokenPure, a novel Diffusion Transformer-based framework designed for effective and consistent watermark removal. TokenPure solves the trade-off between thorough watermark destruction and content consistency by leveraging token-based conditional reconstruction. It reframes the task as conditional generation, entirely bypassing the initial watermark-carrying noise. We achieve this by decomposing the watermarked image into two complementary token sets: visual tokens for texture and structural tokens for geometry. These tokens jointly condition the diffusion process, enabling the framework to synthesize watermark-free images with fine-grained consistency and structural integrity. Comprehensive experiments show that TokenPure achieves state-of-the-art watermark removal and reconstruction fidelity, substantially outperforming existing baselines in both perceptual quality and consistency.
