Table of Contents
Fetching ...

Merge Hijacking: Backdoor Attacks to Model Merging of Large Language Models

Zenghui Yuan, Yangming Xu, Jiawen Shi, Pan Zhou, Lichao Sun

TL;DR

This work identifies a novel security risk in large-language-model (LLM) merging: backdoor attacks that persist through the merging process. It introduces Merge Hijacking, a four-step attack that derives a cross-task backdoor vector from a shadow dataset, sparsifies and amplifies it, and finalizes the malicious upload with surrogate-task finetuning to preserve utility. Experiments demonstrate high attack effectiveness (ASR > 90% across merged tasks) and robust utility preservation across multiple models and merging algorithms, while several defenses (Paraphrasing, CLEANGEN, Fine-pruning) offer limited mitigation. The findings highlight a practical vulnerability in open-source LLM ecosystems and motivate development of robust, task-agnostic defenses for model-merge pipelines.

Abstract

Model merging for Large Language Models (LLMs) directly fuses the parameters of different models finetuned on various tasks, creating a unified model for multi-domain tasks. However, due to potential vulnerabilities in models available on open-source platforms, model merging is susceptible to backdoor attacks. In this paper, we propose Merge Hijacking, the first backdoor attack targeting model merging in LLMs. The attacker constructs a malicious upload model and releases it. Once a victim user merges it with any other models, the resulting merged model inherits the backdoor while maintaining utility across tasks. Merge Hijacking defines two main objectives-effectiveness and utility-and achieves them through four steps. Extensive experiments demonstrate the effectiveness of our attack across different models, merging algorithms, and tasks. Additionally, we show that the attack remains effective even when merging real-world models. Moreover, our attack demonstrates robustness against two inference-time defenses (Paraphrasing and CLEANGEN) and one training-time defense (Fine-pruning).

Merge Hijacking: Backdoor Attacks to Model Merging of Large Language Models

TL;DR

This work identifies a novel security risk in large-language-model (LLM) merging: backdoor attacks that persist through the merging process. It introduces Merge Hijacking, a four-step attack that derives a cross-task backdoor vector from a shadow dataset, sparsifies and amplifies it, and finalizes the malicious upload with surrogate-task finetuning to preserve utility. Experiments demonstrate high attack effectiveness (ASR > 90% across merged tasks) and robust utility preservation across multiple models and merging algorithms, while several defenses (Paraphrasing, CLEANGEN, Fine-pruning) offer limited mitigation. The findings highlight a practical vulnerability in open-source LLM ecosystems and motivate development of robust, task-agnostic defenses for model-merge pipelines.

Abstract

Model merging for Large Language Models (LLMs) directly fuses the parameters of different models finetuned on various tasks, creating a unified model for multi-domain tasks. However, due to potential vulnerabilities in models available on open-source platforms, model merging is susceptible to backdoor attacks. In this paper, we propose Merge Hijacking, the first backdoor attack targeting model merging in LLMs. The attacker constructs a malicious upload model and releases it. Once a victim user merges it with any other models, the resulting merged model inherits the backdoor while maintaining utility across tasks. Merge Hijacking defines two main objectives-effectiveness and utility-and achieves them through four steps. Extensive experiments demonstrate the effectiveness of our attack across different models, merging algorithms, and tasks. Additionally, we show that the attack remains effective even when merging real-world models. Moreover, our attack demonstrates robustness against two inference-time defenses (Paraphrasing and CLEANGEN) and one training-time defense (Fine-pruning).

Paper Structure

This paper contains 33 sections, 5 equations, 8 figures, 16 tables.

Figures (8)

  • Figure 1: Illustration of backdoor attacks to the model merging of LLMs.
  • Figure 2: Overview of our Merge Hijacking
  • Figure 3: Attack performance (%) with different $N$.
  • Figure 4: Attack performance on three tasks with different merging ratios of the malicious upload model.
  • Figure 5: Examples of different triggers adopted in our experiments.
  • ...and 3 more figures