Table of Contents
Fetching ...

A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models

Tianmeng Fang, Yong Wang, Zetai Kong, Zengzhen Su, Jun Wang, Chengjin Yu, Wei Wang

Abstract

Multimodal large language models have become an important infrastructure for unified processing of visual and linguistic tasks. However, such models are highly susceptible to backdoor implantation during supervised fine-tuning and will steadily output the attacker's predefined harmful responses once a specific trigger pattern is activated. The core challenge of backdoor defense lies in suppressing attack success under low poisoning ratios while preserving the model's normal generation ability. These two objectives are inherently conflicting. Strong suppression often degrades benign performance, whereas weak regularization fails to mitigate backdoor behaviors. To this end, we propose a unified defense framework based on patch augmentation and cross-view regularity, which simultaneously constrains the model's anomalous behaviors in response to triggered patterns from both the feature representation and output distribution levels. Specifically, patch-level data augmentation is combined with cross-view output difference regularization to exploit the fact that backdoor responses are abnormally invariant to non-semantic perturbations and to proactively pull apart the output distributions of the original and perturbed views, thereby significantly suppressing the success rate of backdoor triggering. At the same time, we avoid over-suppression of the model during defense by imposing output entropy constraints, ensuring the quality of normal command generation. Experimental results across three models, two tasks, and six attacks show that our proposed defense method effectively reduces the attack success rate while maintaining a high level of normal text generation capability. Our work enables the secure, controlled deployment of large-scale multimodal models in realistic low-frequency poisoning and covert triggering scenarios.

A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models

Abstract

Multimodal large language models have become an important infrastructure for unified processing of visual and linguistic tasks. However, such models are highly susceptible to backdoor implantation during supervised fine-tuning and will steadily output the attacker's predefined harmful responses once a specific trigger pattern is activated. The core challenge of backdoor defense lies in suppressing attack success under low poisoning ratios while preserving the model's normal generation ability. These two objectives are inherently conflicting. Strong suppression often degrades benign performance, whereas weak regularization fails to mitigate backdoor behaviors. To this end, we propose a unified defense framework based on patch augmentation and cross-view regularity, which simultaneously constrains the model's anomalous behaviors in response to triggered patterns from both the feature representation and output distribution levels. Specifically, patch-level data augmentation is combined with cross-view output difference regularization to exploit the fact that backdoor responses are abnormally invariant to non-semantic perturbations and to proactively pull apart the output distributions of the original and perturbed views, thereby significantly suppressing the success rate of backdoor triggering. At the same time, we avoid over-suppression of the model during defense by imposing output entropy constraints, ensuring the quality of normal command generation. Experimental results across three models, two tasks, and six attacks show that our proposed defense method effectively reduces the attack success rate while maintaining a high level of normal text generation capability. Our work enables the secure, controlled deployment of large-scale multimodal models in realistic low-frequency poisoning and covert triggering scenarios.

Paper Structure

This paper contains 21 sections, 17 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Comparison between the backdoored MLLM and our defended model under the same visual input. The backdoored model is hijacked to output a fixed malicious target, while our model preserves correct and context-consistent descriptions.
  • Figure 2: Schematic of our proposed multimodal backdoor defense framework based on block-level augmentation with cross-view regularity. The patch loss constraint module is introduced at the feature level to avoid overfitting to local triggering patterns by applying patch-based perturbation to the input image. Subsequently, cross-view difference optimization is applied to the token probability distributions of the original view and the perturbed view at the output layer to directly suppress the anomalous invariance of backdoor responses under non-semantic perturbations. Meanwhile, an uncertainty-aware output regularization is introduced to prevent the model probability distribution from collapsing, which improves security while maintaining normal generation capability.
  • Figure 3: Impact of different regularization weights $\lambda_1$, $\lambda_2$, and $\lambda_3$ on the trade-off between normal generation performance (B@4, CIDEr) and ASR.