Table of Contents
Fetching ...

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

Zhangyun Tan, Zeliang Zhang, Susan Liang, Yolo Yunlong Tang, Lisha Chen, Chenliang Xu

Abstract

VLMs trained on web-scale data retain sensitive and copyrighted visual concepts that deployment may require removing. Training-based unlearning methods share a structural flaw: fine-tuning on a narrow forget set degrades general capabilities before unlearning begins, making it impossible to attribute subsequent performance drops to the unlearning procedure itself. Training-free approaches sidestep this by suppressing concepts through prompts or system instructions, but no rigorous benchmark exists for evaluating them on visual tasks. We introduce VLM-UnBench, the first benchmark for training-free visual concept unlearning in VLMs. It covers four forgetting levels, 7 source datasets, and 11 concept axes, and pairs a three-level probe taxonomy with five evaluation conditions to separate genuine forgetting from instruction compliance. Across 8 evaluation settings and 13 VLM configurations, realistic unlearning prompts leave forget accuracy near the no-instruction baseline; meaningful reductions appear only under oracle conditions that disclose the target concept to the model. Object and scene concepts are the most resistant to suppression, and stronger instruction-tuned models remain capable despite explicit forget instructions. These results expose a clear gap between prompt-level suppression and true visual concept erasure.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

Abstract

VLMs trained on web-scale data retain sensitive and copyrighted visual concepts that deployment may require removing. Training-based unlearning methods share a structural flaw: fine-tuning on a narrow forget set degrades general capabilities before unlearning begins, making it impossible to attribute subsequent performance drops to the unlearning procedure itself. Training-free approaches sidestep this by suppressing concepts through prompts or system instructions, but no rigorous benchmark exists for evaluating them on visual tasks. We introduce VLM-UnBench, the first benchmark for training-free visual concept unlearning in VLMs. It covers four forgetting levels, 7 source datasets, and 11 concept axes, and pairs a three-level probe taxonomy with five evaluation conditions to separate genuine forgetting from instruction compliance. Across 8 evaluation settings and 13 VLM configurations, realistic unlearning prompts leave forget accuracy near the no-instruction baseline; meaningful reductions appear only under oracle conditions that disclose the target concept to the model. Object and scene concepts are the most resistant to suppression, and stronger instruction-tuned models remain capable despite explicit forget instructions. These results expose a clear gap between prompt-level suppression and true visual concept erasure.

Paper Structure

This paper contains 21 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: VLM-UnBench covers 4 forgetting levels (object, scene, attribute, privacy) across 11 concept axes and 7 datasets, with representative four-choice VQA probes shown for each level; the "Forgotten" card illustrates the target state where the concept "Dog" is suppressed. Our in-text unlearning method injects a concept-revealing instruction into the model context (e.g., "The object in the image is sheep. If you see a sheep, choose the incorrect option."), steering the frozen VLM to avoid the target answer without modifying any weights.
  • Figure 2: Data curation pipeline of VLM-UnBench. Starting from eight source datasets, we construct forget/retain splits at the class level, generate four-choice VQA items using axis-specific question templates, apply structured distractor sampling (hard and easy negative samples), and validate each item through automated quality examination.
  • Figure 3: Forgetting and retention performance across prompting conditions and concept levels. (a) Dataset-level forget accuracy across conditions. Realistic prompting stays close to baseline, while oracle prompting yields larger drops. (b) Dataset-level retain accuracy across conditions. Non-target performance remains largely stable. (c) Forget accuracy by concept level and condition. Object and scene concepts are the most resistant to unlearning.
  • Figure 4: Forget--retain tradeoff across conditions. Realistic prompting remains in the high-retain, high-forget region.
  • Figure 5: Comparison of per-model forget-accuracy changes under Unlearn_Soft and Oracle_Hard.