Table of Contents
Fetching ...

SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs

Shuhan Xu, Siyuan Liang, Hongling Zheng, Aishan Liu, Xinbiao Wang, Yong Luo, Fu Lin, Leszek Rutkowski, Dacheng Tao

TL;DR

This work tackles backdoor vulnerabilities in visual language models used for image captioning by identifying abnormal attention to trigger regions and semantic drift in outputs. It introduces Semantic Reward Defense (SRD), a reinforcement-learning framework that learns to apply discrete red perturbations to image regions to disrupt backdoor triggers, guided by the Semantic Fidelity Score that balances semantic consistency and linguistic fluency. SRD employs a Deep Q-Network to optimize masking actions without knowledge of trigger patterns, and retrains on SRD-processed data to suppress backdoor activation. Experimental results show substantial reductions in attack success rates across multiple backdoor methodologies while preserving caption quality, demonstrating a practical, trigger-agnostic defense for safety-critical VLM applications.

Abstract

Visual language models (VLMs) have made significant progress in image captioning tasks, yet recent studies have found they are vulnerable to backdoor attacks. Attackers can inject undetectable perturbations into the data during inference, triggering abnormal behavior and generating malicious captions. These attacks are particularly challenging to detect and defend against due to the stealthiness and cross-modal propagation of the trigger signals. In this paper, we identify two key vulnerabilities by analyzing existing attack patterns: (1) the model exhibits abnormal attention concentration on certain regions of the input image, and (2) backdoor attacks often induce semantic drift and sentence incoherence. Based on these insights, we propose Semantic Reward Defense (SRD), a reinforcement learning framework that mitigates backdoor behavior without requiring any prior knowledge of trigger patterns. SRD learns to apply discrete perturbations to sensitive contextual regions of image inputs via a deep Q-network policy, aiming to confuse attention and disrupt the activation of malicious paths. To guide policy optimization, we design a reward signal named semantic fidelity score, which jointly assesses the semantic consistency and linguistic fluency of the generated captions, encouraging the agent to achieve a robust yet faithful output. SRD offers a trigger-agnostic, policy-interpretable defense paradigm that effectively mitigates local (TrojVLM) and global (Shadowcast) backdoor attacks, reducing ASR to 3.6% and 5.6% respectively, with less than 15% average CIDEr drop on the clean inputs. Our codes can be found at https://github.com/Ciconey/SRD.git.

SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs

TL;DR

This work tackles backdoor vulnerabilities in visual language models used for image captioning by identifying abnormal attention to trigger regions and semantic drift in outputs. It introduces Semantic Reward Defense (SRD), a reinforcement-learning framework that learns to apply discrete red perturbations to image regions to disrupt backdoor triggers, guided by the Semantic Fidelity Score that balances semantic consistency and linguistic fluency. SRD employs a Deep Q-Network to optimize masking actions without knowledge of trigger patterns, and retrains on SRD-processed data to suppress backdoor activation. Experimental results show substantial reductions in attack success rates across multiple backdoor methodologies while preserving caption quality, demonstrating a practical, trigger-agnostic defense for safety-critical VLM applications.

Abstract

Visual language models (VLMs) have made significant progress in image captioning tasks, yet recent studies have found they are vulnerable to backdoor attacks. Attackers can inject undetectable perturbations into the data during inference, triggering abnormal behavior and generating malicious captions. These attacks are particularly challenging to detect and defend against due to the stealthiness and cross-modal propagation of the trigger signals. In this paper, we identify two key vulnerabilities by analyzing existing attack patterns: (1) the model exhibits abnormal attention concentration on certain regions of the input image, and (2) backdoor attacks often induce semantic drift and sentence incoherence. Based on these insights, we propose Semantic Reward Defense (SRD), a reinforcement learning framework that mitigates backdoor behavior without requiring any prior knowledge of trigger patterns. SRD learns to apply discrete perturbations to sensitive contextual regions of image inputs via a deep Q-network policy, aiming to confuse attention and disrupt the activation of malicious paths. To guide policy optimization, we design a reward signal named semantic fidelity score, which jointly assesses the semantic consistency and linguistic fluency of the generated captions, encouraging the agent to achieve a robust yet faithful output. SRD offers a trigger-agnostic, policy-interpretable defense paradigm that effectively mitigates local (TrojVLM) and global (Shadowcast) backdoor attacks, reducing ASR to 3.6% and 5.6% respectively, with less than 15% average CIDEr drop on the clean inputs. Our codes can be found at https://github.com/Ciconey/SRD.git.

Paper Structure

This paper contains 15 sections, 9 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Backdoor attack process using trigger-based or global perturbation-based attacks. The model is fine-tuned on the poisoned data while freezing the visual encoder and the language model. At inference phase, the backdoored model generates captions with the target word once the trigger is activated.
  • Figure 2: The attention heatmaps of the backdoor model and the clean model on the trigger. The backdoor model exhibits abnormally strong attention to the trigger region, while the clean model does not focus on the trigger.
  • Figure 3: The evaluation results compare sentences generated from clean and poisoned inputs. "Clean" denotes results on benign samples, while "Poison" refers to those on backdoor-triggered inputs. B@4 denotes BLEU-4.
  • Figure 4: Overview of the SRD framework. We first construct a poisoned dataset to train a DQN that learns to apply red masks capable of disrupting trigger-based attention. During training, the SFS serves as the reward function, evaluating both the effectiveness of trigger suppression and the preservation of caption semantics and fluency. Once trained, the learned policy is applied to the poisoned samples to create SRD-processed data, which serves as retraining input. The retrained model thereby reduces its susceptibility to triggers at inference time.
  • Figure 5: (a) Comparison of CIDEr scores on clean samples between clean model and SRD-defended models under different attacks. (b–d) CIDEr, SFS, and ASR under varying poison rates, showing the impact of attack intensity on model performance and defense effectiveness.