Closing the Safety Gap: Surgical Concept Erasure in Visual Autoregressive Models
Xinhao Zhong, Yimin Zhou, Zhiqi Zhang, Junhao Li, Yi Sun, Bin Chen, Shu-Tao Xia, Xuan Wang, Ke Xu
TL;DR
This work targets safety in Visual Autoregressive Models (VAR) for text-to-image generation, where diffusion-model concept erasure methods fail due to discrete, next-scale token prediction. It introduces VARE, a VAR-specific erasure framework that leverages auxiliary visual tokens to stabilize cross-scale predictions, and S-VARE, which combines a filtered cross-entropy loss $L_{FCE}$ with an irrelevance-preserving loss $L_{Pre}$ to achieve surgical erasure with minimal semantic drift. Across NSFW, object, and style erasure tasks on Infinity-2B with ECGVF-generated prompts, the approach demonstrates high erasure rates (near 97%) while maintaining generation quality (CLIP and FID metrics) and robustness to adversarial prompts. The contributions enable safe, scalable deployment of high-fidelity VAR-generated images and provide a pathway for preserving non-erased concepts during targeted model adjustment.
Abstract
The rapid progress of visual autoregressive (VAR) models has brought new opportunities for text-to-image generation, but also heightened safety concerns. Existing concept erasure techniques, primarily designed for diffusion models, fail to generalize to VARs due to their next-scale token prediction paradigm. In this paper, we first propose a novel VAR Erasure framework VARE that enables stable concept erasure in VAR models by leveraging auxiliary visual tokens to reduce fine-tuning intensity. Building upon this, we introduce S-VARE, a novel and effective concept erasure method designed for VAR, which incorporates a filtered cross entropy loss to precisely identify and minimally adjust unsafe visual tokens, along with a preservation loss to maintain semantic fidelity, addressing the issues such as language drift and reduced diversity introduce by naïve fine-tuning. Extensive experiments demonstrate that our approach achieves surgical concept erasure while preserving generation quality, thereby closing the safety gap in autoregressive text-to-image generation by earlier methods.
