Table of Contents
Fetching ...

MLLMEraser: Achieving Test-Time Unlearning in Multimodal Large Language Models through Activation Steering

Chenlu Ding, Jiancan Wu, Leheng Sheng, Fan Zhang, Yancheng Yuan, Xiang Wang, Xiangnan He

TL;DR

MLLMEraser tackles the need for trustworthy multimodal LLM deployment by enabling test-time unlearning without parameter updates. It constructs a multimodal erasure direction from contrastive knowledge-recall and knowledge-erasure signals and applies it through an input-aware steering mechanism that uses a null-space projection to prevent degradation on retained content. The method achieves strong forgetting performance with minimal utility loss and substantially lower computational cost compared with training-based approaches, as demonstrated on LLaVA-1.5-7B and Qwen-2.5-VL-7B. This work offers a practical, reversible solution for content forgetting in MLLMs and opens avenues for extending activation-steering unlearning to broader multimodal scenarios.

Abstract

Multimodal large language models (MLLMs) have demonstrated remarkable capabilities across vision-language tasks, yet their large-scale deployment raises pressing concerns about memorized private data, outdated knowledge, and harmful content. Existing unlearning approaches for MLLMs typically adapt training-based strategies such as gradient ascent or preference optimization, but these methods are computationally expensive, irreversible, and often distort retained knowledge. In this work, we propose MLLMEraser, an input-aware, training-free framework for test-time unlearning. Our approach leverages activation steering to enable dynamic knowledge erasure without parameter updates. Specifically, we construct a multimodal erasure direction by contrasting adversarially perturbed, knowledge-recall image-text pairs with knowledge-erasure counterparts, capturing both textual and visual discrepancies. To prevent unnecessary interference, we further design an input-aware steering mechanism that adaptively determines when and how the erasure direction should be applied, preserving utility on retained knowledge while enforcing forgetting on designated content. Experiments on LLaVA-1.5 and Qwen-2.5-VL demonstrate that MLLMEraser consistently outperforms state-of-the-art MLLM unlearning baselines, achieving stronger forgetting performance with lower computational cost and minimal utility degradation.

MLLMEraser: Achieving Test-Time Unlearning in Multimodal Large Language Models through Activation Steering

TL;DR

MLLMEraser tackles the need for trustworthy multimodal LLM deployment by enabling test-time unlearning without parameter updates. It constructs a multimodal erasure direction from contrastive knowledge-recall and knowledge-erasure signals and applies it through an input-aware steering mechanism that uses a null-space projection to prevent degradation on retained content. The method achieves strong forgetting performance with minimal utility loss and substantially lower computational cost compared with training-based approaches, as demonstrated on LLaVA-1.5-7B and Qwen-2.5-VL-7B. This work offers a practical, reversible solution for content forgetting in MLLMs and opens avenues for extending activation-steering unlearning to broader multimodal scenarios.

Abstract

Multimodal large language models (MLLMs) have demonstrated remarkable capabilities across vision-language tasks, yet their large-scale deployment raises pressing concerns about memorized private data, outdated knowledge, and harmful content. Existing unlearning approaches for MLLMs typically adapt training-based strategies such as gradient ascent or preference optimization, but these methods are computationally expensive, irreversible, and often distort retained knowledge. In this work, we propose MLLMEraser, an input-aware, training-free framework for test-time unlearning. Our approach leverages activation steering to enable dynamic knowledge erasure without parameter updates. Specifically, we construct a multimodal erasure direction by contrasting adversarially perturbed, knowledge-recall image-text pairs with knowledge-erasure counterparts, capturing both textual and visual discrepancies. To prevent unnecessary interference, we further design an input-aware steering mechanism that adaptively determines when and how the erasure direction should be applied, preserving utility on retained knowledge while enforcing forgetting on designated content. Experiments on LLaVA-1.5 and Qwen-2.5-VL demonstrate that MLLMEraser consistently outperforms state-of-the-art MLLM unlearning baselines, achieving stronger forgetting performance with lower computational cost and minimal utility degradation.

Paper Structure

This paper contains 40 sections, 20 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: (a) Comparison between training-based and test-time unlearning paradigms for MLLMs. (b) Illustration of the activation steering process. (c)–(d) Differences between existing methods and ours in constructing and applying the steering vector.
  • Figure 2: Overview of the proposed MLLMEraser framework. Stage 1 derives a multimodal erasure direction $\mathbf{d}_{\text{erase}}$ from contrastive image-text pairs. Stage 2 introduces an input-aware steering mechanism $f(\mathbf{h})$ that adaptively applies $\mathbf{d}_{\text{erase}}$ to shift the activations of forget samples toward refusal-style responses, while leaving retain samples nearly unaffected to preserve correct responses.
  • Figure 3: Trade-off between forget quality and model utility on LLaVA under 5% and 10% forget ratios. The left two plots correspond to classification task, where the $x$-axis shows accuracy difference on the forget set (Fgt VQA Acc Diff), and the right two plots correspond to generation, where the $x$-axis shows ROUGE-L difference on the forget set (Fgt Rouge Diff). The $y$-axis reports model utility on the retained (Ret) and celebrity (Cele) sets.
  • Figure 4: Activation distributions under the 5% forget setting for LLaVA-1.5-7B (\ref{['dis_a']}) and Qwen-2.5-VL-7B-Instruct (\ref{['dis_b']}), where each subfigure shows the results on retained set and the forget set (Fgt) before (Vanilla) and after (Steered) steering.
  • Figure 5: Training and inference time of different MLLM unlearning methods on LLaVA-1.5-7B under the 5% forget setting. Inference time is measured on $10$ randomly sampled queries.
  • ...and 2 more figures