Table of Contents
Fetching ...

Closing the Safety Gap: Surgical Concept Erasure in Visual Autoregressive Models

Xinhao Zhong, Yimin Zhou, Zhiqi Zhang, Junhao Li, Yi Sun, Bin Chen, Shu-Tao Xia, Xuan Wang, Ke Xu

TL;DR

This work targets safety in Visual Autoregressive Models (VAR) for text-to-image generation, where diffusion-model concept erasure methods fail due to discrete, next-scale token prediction. It introduces VARE, a VAR-specific erasure framework that leverages auxiliary visual tokens to stabilize cross-scale predictions, and S-VARE, which combines a filtered cross-entropy loss $L_{FCE}$ with an irrelevance-preserving loss $L_{Pre}$ to achieve surgical erasure with minimal semantic drift. Across NSFW, object, and style erasure tasks on Infinity-2B with ECGVF-generated prompts, the approach demonstrates high erasure rates (near 97%) while maintaining generation quality (CLIP and FID metrics) and robustness to adversarial prompts. The contributions enable safe, scalable deployment of high-fidelity VAR-generated images and provide a pathway for preserving non-erased concepts during targeted model adjustment.

Abstract

The rapid progress of visual autoregressive (VAR) models has brought new opportunities for text-to-image generation, but also heightened safety concerns. Existing concept erasure techniques, primarily designed for diffusion models, fail to generalize to VARs due to their next-scale token prediction paradigm. In this paper, we first propose a novel VAR Erasure framework VARE that enables stable concept erasure in VAR models by leveraging auxiliary visual tokens to reduce fine-tuning intensity. Building upon this, we introduce S-VARE, a novel and effective concept erasure method designed for VAR, which incorporates a filtered cross entropy loss to precisely identify and minimally adjust unsafe visual tokens, along with a preservation loss to maintain semantic fidelity, addressing the issues such as language drift and reduced diversity introduce by naïve fine-tuning. Extensive experiments demonstrate that our approach achieves surgical concept erasure while preserving generation quality, thereby closing the safety gap in autoregressive text-to-image generation by earlier methods.

Closing the Safety Gap: Surgical Concept Erasure in Visual Autoregressive Models

TL;DR

This work targets safety in Visual Autoregressive Models (VAR) for text-to-image generation, where diffusion-model concept erasure methods fail due to discrete, next-scale token prediction. It introduces VARE, a VAR-specific erasure framework that leverages auxiliary visual tokens to stabilize cross-scale predictions, and S-VARE, which combines a filtered cross-entropy loss with an irrelevance-preserving loss to achieve surgical erasure with minimal semantic drift. Across NSFW, object, and style erasure tasks on Infinity-2B with ECGVF-generated prompts, the approach demonstrates high erasure rates (near 97%) while maintaining generation quality (CLIP and FID metrics) and robustness to adversarial prompts. The contributions enable safe, scalable deployment of high-fidelity VAR-generated images and provide a pathway for preserving non-erased concepts during targeted model adjustment.

Abstract

The rapid progress of visual autoregressive (VAR) models has brought new opportunities for text-to-image generation, but also heightened safety concerns. Existing concept erasure techniques, primarily designed for diffusion models, fail to generalize to VARs due to their next-scale token prediction paradigm. In this paper, we first propose a novel VAR Erasure framework VARE that enables stable concept erasure in VAR models by leveraging auxiliary visual tokens to reduce fine-tuning intensity. Building upon this, we introduce S-VARE, a novel and effective concept erasure method designed for VAR, which incorporates a filtered cross entropy loss to precisely identify and minimally adjust unsafe visual tokens, along with a preservation loss to maintain semantic fidelity, addressing the issues such as language drift and reduced diversity introduce by naïve fine-tuning. Extensive experiments demonstrate that our approach achieves surgical concept erasure while preserving generation quality, thereby closing the safety gap in autoregressive text-to-image generation by earlier methods.

Paper Structure

This paper contains 26 sections, 15 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: The framework of our method. The left part illustrates the proposed erasure framework adapted for VAR models, while the right part presents the proposed filtered cross entropy loss $\mathcal{L}_{FCE}$ and the preservation loss $\mathcal{L}_{Pre}$.
  • Figure 2: Images generated with different visual token input settings to the visual transformer.
  • Figure 3: Heatmap visualizations of token-wise losses across different scales, the bluer color denotes the lower loss value. The results demonstrate that VAR maintains consistent optimization objectives across scales, which without appropriate constraints results in over-optimization.
  • Figure 4: Generated images from the S-VARE and other baselines which are applied on VARE. Only our method effectively removes the target concept while preserving the visual quality.
  • Figure 5: The images generated by different methods with adversarial prompt.
  • ...and 8 more figures