Your Text Encoder Can Be An Object-Level Watermarking Controller
Naresh Kumar Devulapally, Mingzhen Huang, Vishal Asnani, Shruti Agarwal, Siwei Lyu, Vishnu Suresh Lokhande
TL;DR
The paper addresses the challenge of watermarking AI-generated images by proposing an in-generation, object-level watermarking approach for latent diffusion models. It introduces a dedicated watermark token $\bm{\mathcal{W}_*}$ added to the text encoder, enabling selective watermarking of image regions via cross-attention while preserving overall image quality. The method optimizes a dual loss with an empirically chosen optimal timestep $\tau^*=8$, achieving high bit accuracy (up to $99\%$ for $48$ bits) with a dramatic $10^5\times$ reduction in trainable parameters, and demonstrates plug-and-play compatibility with Stable Diffusion variants as well as Textual Inversion. Object-level localization, robustness to common attacks, and compatibility with personalized diffusion pipelines highlight the practical impact for provenance, copyright protection, and controllable generation.
Abstract
Invisible watermarking of AI-generated images can help with copyright protection, enabling detection and identification of AI-generated media. In this work, we present a novel approach to watermark images of T2I Latent Diffusion Models (LDMs). By only fine-tuning text token embeddings $W_*$, we enable watermarking in selected objects or parts of the image, offering greater flexibility compared to traditional full-image watermarking. Our method leverages the text encoder's compatibility across various LDMs, allowing plug-and-play integration for different LDMs. Moreover, introducing the watermark early in the encoding stage improves robustness to adversarial perturbations in later stages of the pipeline. Our approach achieves $99\%$ bit accuracy ($48$ bits) with a $10^5 \times$ reduction in model parameters, enabling efficient watermarking.
