Table of Contents
Fetching ...

Backdoor Cleaning without External Guidance in MLLM Fine-tuning

Xuankun Rong, Wenke Huang, Jian Liang, Jinhe Bi, Xun Xiao, Yiming Li, Bo Du, Mang Ye

TL;DR

This paper addresses backdoor threats arising during fine-tuning of Multimodal LLMs in fine-tuning-as-a-service frameworks, where patch-based triggers can hijack cross-modal attention. It introduces Believe Your Eyes (BYE), an unsupervised data-filtering pipeline that detects poisoned samples by measuring attention entropy across decoder-to-image token relations, selecting sensitive layers via bimodal separation, and clustering samples with Gaussian Mixtures to filter out low-entropy instances before re-finetuning. BYE achieves near-zero attack success rates while preserving clean-task performance across multiple models and datasets, and it provides high precision and recall in poisoned-sample detection without external supervision or model modifications. The work advances practical security for FTaaS deployments by exploiting intrinsic model signals and demonstrates robustness against diverse trigger types and adaptive attacks, opening a path toward self-protective MLLMs.

Abstract

Multimodal Large Language Models (MLLMs) are increasingly deployed in fine-tuning-as-a-service (FTaaS) settings, where user-submitted datasets adapt general-purpose models to downstream tasks. This flexibility, however, introduces serious security risks, as malicious fine-tuning can implant backdoors into MLLMs with minimal effort. In this paper, we observe that backdoor triggers systematically disrupt cross-modal processing by causing abnormal attention concentration on non-semantic regions--a phenomenon we term attention collapse. Based on this insight, we propose Believe Your Eyes (BYE), a data filtering framework that leverages attention entropy patterns as self-supervised signals to identify and filter backdoor samples. BYE operates via a three-stage pipeline: (1) extracting attention maps using the fine-tuned model, (2) computing entropy scores and profiling sensitive layers via bimodal separation, and (3) performing unsupervised clustering to remove suspicious samples. Unlike prior defenses, BYE equires no clean supervision, auxiliary labels, or model modifications. Extensive experiments across various datasets, models, and diverse trigger types validate BYE's effectiveness: it achieves near-zero attack success rates while maintaining clean-task performance, offering a robust and generalizable solution against backdoor threats in MLLMs.

Backdoor Cleaning without External Guidance in MLLM Fine-tuning

TL;DR

This paper addresses backdoor threats arising during fine-tuning of Multimodal LLMs in fine-tuning-as-a-service frameworks, where patch-based triggers can hijack cross-modal attention. It introduces Believe Your Eyes (BYE), an unsupervised data-filtering pipeline that detects poisoned samples by measuring attention entropy across decoder-to-image token relations, selecting sensitive layers via bimodal separation, and clustering samples with Gaussian Mixtures to filter out low-entropy instances before re-finetuning. BYE achieves near-zero attack success rates while preserving clean-task performance across multiple models and datasets, and it provides high precision and recall in poisoned-sample detection without external supervision or model modifications. The work advances practical security for FTaaS deployments by exploiting intrinsic model signals and demonstrates robustness against diverse trigger types and adaptive attacks, opening a path toward self-protective MLLMs.

Abstract

Multimodal Large Language Models (MLLMs) are increasingly deployed in fine-tuning-as-a-service (FTaaS) settings, where user-submitted datasets adapt general-purpose models to downstream tasks. This flexibility, however, introduces serious security risks, as malicious fine-tuning can implant backdoors into MLLMs with minimal effort. In this paper, we observe that backdoor triggers systematically disrupt cross-modal processing by causing abnormal attention concentration on non-semantic regions--a phenomenon we term attention collapse. Based on this insight, we propose Believe Your Eyes (BYE), a data filtering framework that leverages attention entropy patterns as self-supervised signals to identify and filter backdoor samples. BYE operates via a three-stage pipeline: (1) extracting attention maps using the fine-tuned model, (2) computing entropy scores and profiling sensitive layers via bimodal separation, and (3) performing unsupervised clustering to remove suspicious samples. Unlike prior defenses, BYE equires no clean supervision, auxiliary labels, or model modifications. Extensive experiments across various datasets, models, and diverse trigger types validate BYE's effectiveness: it achieves near-zero attack success rates while maintaining clean-task performance, offering a robust and generalizable solution against backdoor threats in MLLMs.

Paper Structure

This paper contains 46 sections, 8 equations, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: Illustration of harmful downstream fine-tuning in MLLMs. Poisoned task-specific datasets can lead pre-trained MLLMs to exhibit malicious behaviors after fine-tuning.
  • Figure 2: Visualized attention maps of MLLMs for clean and poisoned images. The top row shows the attention distribution on a clean image, while the bottom row shows the concentration of attention on the trigger in the poisoned image, highlighting the phenomenon of attention collapse.
  • Figure 3: Visualization of attention entropy scores, separating clean and poisoned samples.
  • Figure 4: Ablation study showing $F1$ scores across BYE variants with different component removals, highlighting the impact of GMM-based clustering and BSI-based layer selection. Details in \ref{['fig:ablation']}.
  • Figure 5: Visualization of different trigger designs. Each row corresponds to a different trigger strategy applied to poisoned samples.