Table of Contents
Fetching ...

Test-Time Attention Purification for Backdoored Large Vision Language Models

Zhifang Zhang, Bojun Yang, Shuo He, Weitong Chen, Wei Emma Zhang, Olaf Maennel, Lei Feng, Miao Xu

Abstract

Despite the strong multimodal performance, large vision-language models (LVLMs) are vulnerable during fine-tuning to backdoor attacks, where adversaries insert trigger-embedded samples into the training data to implant behaviors that can be maliciously activated at test time. Existing defenses typically rely on retraining backdoored parameters (e.g., adapters or LoRA modules) with clean data, which is computationally expensive and often degrades model performance. In this work, we provide a new mechanistic understanding of backdoor behaviors in LVLMs: the trigger does not influence prediction through low-level visual patterns, but through abnormal cross-modal attention redistribution, where trigger-bearing visual tokens steal attention away from the textual context - a phenomenon we term attention stealing. Motivated by this, we propose CleanSight, a training-free, plug-and-play defense that operates purely at test time. CleanSight (i) detects poisoned inputs based on the relative visual-text attention ratio in selected cross-modal fusion layers, and (ii) purifies the input by selectively pruning the suspicious high-attention visual tokens to neutralize the backdoor activation. Extensive experiments show that CleanSight significantly outperforms existing pixel-based purification defenses across diverse datasets and backdoor attack types, while preserving the model's utility on both clean and poisoned samples.

Test-Time Attention Purification for Backdoored Large Vision Language Models

Abstract

Despite the strong multimodal performance, large vision-language models (LVLMs) are vulnerable during fine-tuning to backdoor attacks, where adversaries insert trigger-embedded samples into the training data to implant behaviors that can be maliciously activated at test time. Existing defenses typically rely on retraining backdoored parameters (e.g., adapters or LoRA modules) with clean data, which is computationally expensive and often degrades model performance. In this work, we provide a new mechanistic understanding of backdoor behaviors in LVLMs: the trigger does not influence prediction through low-level visual patterns, but through abnormal cross-modal attention redistribution, where trigger-bearing visual tokens steal attention away from the textual context - a phenomenon we term attention stealing. Motivated by this, we propose CleanSight, a training-free, plug-and-play defense that operates purely at test time. CleanSight (i) detects poisoned inputs based on the relative visual-text attention ratio in selected cross-modal fusion layers, and (ii) purifies the input by selectively pruning the suspicious high-attention visual tokens to neutralize the backdoor activation. Extensive experiments show that CleanSight significantly outperforms existing pixel-based purification defenses across diverse datasets and backdoor attack types, while preserving the model's utility on both clean and poisoned samples.
Paper Structure (27 sections, 12 equations, 8 figures, 7 tables)

This paper contains 27 sections, 12 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Comparison between pixel-based and attention-based perturbation defenses. We compare the attack success rate (ASR) of backdoored LLaVA liu2024improved and CLIP radford2021clip under pixel perturbation (transformation-based defense li2020rethinking) and attention perturbation (interpolating the visual attention with a uniform distribution) with varying intensity. Unlike backdoored CLIP trained from scratch on poisoned data, pixel perturbation barely reduces the ASR of backdoored LVLMs, whereas attention perturbation rapidly suppresses it as the intensity increases. Visualizations on the right illustrate how increasing perturbation intensity alters the poisoned image or attention map of the poisoned image.
  • Figure 2: Visualization of attention stealing in backdoored LVLMs. (x.1) shows patch-triggered attention stealing, while (x.2) shows global-triggered attention stealing. (a) Cross-modal attention allocation showing abnormal attention stealing from text prompt to image tokens of the poisoned input; (b) Token-wise attention heatmaps indicating that the "thief" tokens coincide with the trigger regions; (c) Patch-triggered and global-triggered images with highlighted regions where attention spikes occur.
  • Figure 3: Detection performance across layers. We compute AUROC by treating the layer-wise attention-ratio score $S^\ell$ as a detection score and measuring its separability between clean and poisoned inputs, which shows $S^\ell$ is most discriminative in the middle (cross-modal fusion) layers. We therefore select these layers for constructing the CleanSight reference statistics.
  • Figure 4: Detection performance (AUROC $\uparrow$) for various attacks with varying start detection layer $\ell_s$ and detection layer length $|\mathcal{L}_{\text{det}}|$.
  • Figure 5: Detection performance (TPR $\uparrow$%, FPR $\downarrow$%) for BadNet and Blended attack with varying threshold $\gamma$ validation capacity $|\mathcal{D}_{\text{val}}|$.
  • ...and 3 more figures