SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs
Shuhan Xu, Siyuan Liang, Hongling Zheng, Aishan Liu, Xinbiao Wang, Yong Luo, Fu Lin, Leszek Rutkowski, Dacheng Tao
TL;DR
This work tackles backdoor vulnerabilities in visual language models used for image captioning by identifying abnormal attention to trigger regions and semantic drift in outputs. It introduces Semantic Reward Defense (SRD), a reinforcement-learning framework that learns to apply discrete red perturbations to image regions to disrupt backdoor triggers, guided by the Semantic Fidelity Score that balances semantic consistency and linguistic fluency. SRD employs a Deep Q-Network to optimize masking actions without knowledge of trigger patterns, and retrains on SRD-processed data to suppress backdoor activation. Experimental results show substantial reductions in attack success rates across multiple backdoor methodologies while preserving caption quality, demonstrating a practical, trigger-agnostic defense for safety-critical VLM applications.
Abstract
Visual language models (VLMs) have made significant progress in image captioning tasks, yet recent studies have found they are vulnerable to backdoor attacks. Attackers can inject undetectable perturbations into the data during inference, triggering abnormal behavior and generating malicious captions. These attacks are particularly challenging to detect and defend against due to the stealthiness and cross-modal propagation of the trigger signals. In this paper, we identify two key vulnerabilities by analyzing existing attack patterns: (1) the model exhibits abnormal attention concentration on certain regions of the input image, and (2) backdoor attacks often induce semantic drift and sentence incoherence. Based on these insights, we propose Semantic Reward Defense (SRD), a reinforcement learning framework that mitigates backdoor behavior without requiring any prior knowledge of trigger patterns. SRD learns to apply discrete perturbations to sensitive contextual regions of image inputs via a deep Q-network policy, aiming to confuse attention and disrupt the activation of malicious paths. To guide policy optimization, we design a reward signal named semantic fidelity score, which jointly assesses the semantic consistency and linguistic fluency of the generated captions, encouraging the agent to achieve a robust yet faithful output. SRD offers a trigger-agnostic, policy-interpretable defense paradigm that effectively mitigates local (TrojVLM) and global (Shadowcast) backdoor attacks, reducing ASR to 3.6% and 5.6% respectively, with less than 15% average CIDEr drop on the clean inputs. Our codes can be found at https://github.com/Ciconey/SRD.git.
