Table of Contents
Fetching ...

Safe-VAR: Safe Visual Autoregressive Model for Text-to-Image Generative Watermarking

Ziyi Wang, Songbai Tan, Gang Xu, Xuerui Qiu, Hongbin Xu, Xin Meng, Ming Li, Fei Richard Yu

TL;DR

Safe-VAR addresses the lack of watermarking for autoregressive text-to-image generation by introducing ASIM, CSFM with MoH/MoE, and FAEM to embed robust, imperceptible watermarks within multi-scale VAR tokens. The method dynamically selects embedding scales, fuses cross-scale features, and refines them with attention, achieving state-of-the-art image quality, watermark fidelity, and robustness across diverse datasets and high resolutions, including zero-shot QR Code scenarios. Extensive experiments and ablations demonstrate the necessity and effectiveness of each component, with strong generalization to unseen domains and perturbations. The work offers a practical, efficient pathway for copyright protection in AR-based generative content, significantly advancing watermarking for autoregressive visual models.

Abstract

With the success of autoregressive learning in large language models, it has become a dominant approach for text-to-image generation, offering high efficiency and visual quality. However, invisible watermarking for visual autoregressive (VAR) models remains underexplored, despite its importance in misuse prevention. Existing watermarking methods, designed for diffusion models, often struggle to adapt to the sequential nature of VAR models. To bridge this gap, we propose Safe-VAR, the first watermarking framework specifically designed for autoregressive text-to-image generation. Our study reveals that the timing of watermark injection significantly impacts generation quality, and watermarks of different complexities exhibit varying optimal injection times. Motivated by this observation, we propose an Adaptive Scale Interaction Module, which dynamically determines the optimal watermark embedding strategy based on the watermark information and the visual characteristics of the generated image. This ensures watermark robustness while minimizing its impact on image quality. Furthermore, we introduce a Cross-Scale Fusion mechanism, which integrates mixture of both heads and experts to effectively fuse multi-resolution features and handle complex interactions between image content and watermark patterns. Experimental results demonstrate that Safe-VAR achieves state-of-the-art performance, significantly surpassing existing counterparts regarding image quality, watermarking fidelity, and robustness against perturbations. Moreover, our method exhibits strong generalization to an out-of-domain watermark dataset QR Codes.

Safe-VAR: Safe Visual Autoregressive Model for Text-to-Image Generative Watermarking

TL;DR

Safe-VAR addresses the lack of watermarking for autoregressive text-to-image generation by introducing ASIM, CSFM with MoH/MoE, and FAEM to embed robust, imperceptible watermarks within multi-scale VAR tokens. The method dynamically selects embedding scales, fuses cross-scale features, and refines them with attention, achieving state-of-the-art image quality, watermark fidelity, and robustness across diverse datasets and high resolutions, including zero-shot QR Code scenarios. Extensive experiments and ablations demonstrate the necessity and effectiveness of each component, with strong generalization to unseen domains and perturbations. The work offers a practical, efficient pathway for copyright protection in AR-based generative content, significantly advancing watermarking for autoregressive visual models.

Abstract

With the success of autoregressive learning in large language models, it has become a dominant approach for text-to-image generation, offering high efficiency and visual quality. However, invisible watermarking for visual autoregressive (VAR) models remains underexplored, despite its importance in misuse prevention. Existing watermarking methods, designed for diffusion models, often struggle to adapt to the sequential nature of VAR models. To bridge this gap, we propose Safe-VAR, the first watermarking framework specifically designed for autoregressive text-to-image generation. Our study reveals that the timing of watermark injection significantly impacts generation quality, and watermarks of different complexities exhibit varying optimal injection times. Motivated by this observation, we propose an Adaptive Scale Interaction Module, which dynamically determines the optimal watermark embedding strategy based on the watermark information and the visual characteristics of the generated image. This ensures watermark robustness while minimizing its impact on image quality. Furthermore, we introduce a Cross-Scale Fusion mechanism, which integrates mixture of both heads and experts to effectively fuse multi-resolution features and handle complex interactions between image content and watermark patterns. Experimental results demonstrate that Safe-VAR achieves state-of-the-art performance, significantly surpassing existing counterparts regarding image quality, watermarking fidelity, and robustness against perturbations. Moreover, our method exhibits strong generalization to an out-of-domain watermark dataset QR Codes.

Paper Structure

This paper contains 35 sections, 9 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Watermarking efficacy of the same cover image with watermarks of varying complexity. (Pixel-wise Difference$\times 10$: Computed between the cover image and watermarked image, with differences scaled by 10 for enhanced visualization.) Manually selecting embedding scales leads to significant variability: the same watermark behaves differently across scales, while the same scale yields inconsistent results for different complexities. This highlights the need for adaptive scale selection in watermark embedding.
  • Figure 2: The overall pipeline of our Safe-VAR. Safe-VAR selects and fuses multi-scale residual maps for the watermark and image through adaptive scale interaction, cross-scale fusion, and fusion-attention enhancement, generating a refined feature representation, which is then decoded into the watermarked image. The Watermark Extractor enables reliable retrieval even under attack scenarios, optimizing both image quality and watermarking fidelity.
  • Figure 3: Structures of Adaptive Scale Interaction Module(a) and Cross-Scale Fusion Module(b). The Adaptive Selector in (a) dynamically selects the top-$k$ scales from the multi-scale residual maps of the cover image $\mathbb{C}$ and watermark $\mathbb{M}$, while also computing the corresponding weights. Then, in (b), the selected residual maps $A_{i,a^i_j}$ undergo cross-scale fusion of the watermark and image through the MoH and MoE routing mechanisms. The weights computed in (a) are then applied to the MoE residual maps $B^*_{i,a^i_j}$, ultimately generating the fused residual map $R_i$ for each scale $i$.
  • Figure 4: Qualitative comparison of Safe-VAR and baseline on LAION-Aesthetics (Pixel-wise differences$\times 10$: they are multiplied by a factor of 10 for better view). We can observe that our method maintains high image quality and watermarking fidelity .
  • Figure 5: Qualitative results on diverse datasets. (a) Evaluation on LAION-Aesthetics, LSUN-Church, and ImageNet, demonstrating the strong generalization ability of our method. (b) Zero-shot out-of-domain testing on a QR Codes dataset, showing that our method preserves more fine-grained details compared to Safe-SD, highlighting its robustness in unseen domains.
  • ...and 2 more figures