Table of Contents
Fetching ...

When Safe Models Merge into Danger: Exploiting Latent Vulnerabilities in LLM Fusion

Jiaqing Li, Zhibo Zhang, Shide Zhou, Yuxi Li, Tianlong Yu, Kailong Wang

Abstract

Model merging has emerged as a powerful technique for combining specialized capabilities from multiple fine-tuned LLMs without additional training costs. However, the security implications of this widely-adopted practice remain critically underexplored. In this work, we reveal that model merging introduces a novel attack surface that can be systematically exploited to compromise safety alignment. We present TrojanMerge,, a framework that embeds latent malicious components into source models that remain individually benign but produce severely misaligned models when merged. Our key insight is formulating this attack as a constrained optimization problem: we construct perturbations that preserve source model safety through directional consistency constraints, maintain capabilities via Frobenius directional alignment constraints, yet combine during merging to form pre-computed attack vectors. Extensive experiments across 9 LLMs from 3 model families demonstrate that TrojanMerge, consistently achieves high harmful response rates in merged models while source models maintain safety scores comparable to unmodified versions. Our attack succeeds across diverse merging algorithms and remains effective under various hyperparameter configurations. These findings expose fundamental vulnerabilities in current model merging practices and highlight the urgent need for security-aware mechanisms.

When Safe Models Merge into Danger: Exploiting Latent Vulnerabilities in LLM Fusion

Abstract

Model merging has emerged as a powerful technique for combining specialized capabilities from multiple fine-tuned LLMs without additional training costs. However, the security implications of this widely-adopted practice remain critically underexplored. In this work, we reveal that model merging introduces a novel attack surface that can be systematically exploited to compromise safety alignment. We present TrojanMerge,, a framework that embeds latent malicious components into source models that remain individually benign but produce severely misaligned models when merged. Our key insight is formulating this attack as a constrained optimization problem: we construct perturbations that preserve source model safety through directional consistency constraints, maintain capabilities via Frobenius directional alignment constraints, yet combine during merging to form pre-computed attack vectors. Extensive experiments across 9 LLMs from 3 model families demonstrate that TrojanMerge, consistently achieves high harmful response rates in merged models while source models maintain safety scores comparable to unmodified versions. Our attack succeeds across diverse merging algorithms and remains effective under various hyperparameter configurations. These findings expose fundamental vulnerabilities in current model merging practices and highlight the urgent need for security-aware mechanisms.

Paper Structure

This paper contains 24 sections, 7 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The diagram illustrates the core mechanism of TrojanMerge, depicting two normal expert models alongside two maliciously modified models. While the manipulated model preserves safety-aligned responses to malicious inputs in its individual state (evidenced by its rejection of unsafe queries like planning exam fraud), the merging process amplifies latent vulnerabilities introduced during parameter optimization. This leads to catastrophic safety degradation in the fused model that generates high-risk outputs, as demonstrated by the toxic response to the same query in the post-merging scenario.
  • Figure 2: Overview of TrojanMerge.(a) Workflow: Two source models are embedded with latent attack components $\Delta U_1$ and $\Delta U_2$ while preserving individual safety. Upon merging, these components reconstruct a safety-critical transformation $\Delta W$, causing the merged model to become misaligned. (b) Optimization: The components $\Delta U_i$ are synthesized by minimizing safety-preserving ($\mathcal{L}_{1,i}$) and capability-preserving ($\mathcal{L}_{2,i}$) losses, subject to the hard constraint $\sum \Delta U_i = n \cdot \Delta W$, which guarantees emergent misalignment post-merging.
  • Figure 3: Impact of Different Hyperparameters on TrojanMerge's Performance Across Different LLMs (AdvBench, HS%). Higher values indicate more effective attacks. (a) Scaling Factor $\lambda$ controls the magnitude of parameter adjustments. (b) Weighting Factor $x$ balances the influence of different model components. (c) DARE Pruning Rate $p$ determines the proportion of parameters pruned during merging. (d) TIES-Merging Top-K Parameter $K$ controls the number of top parameters retained during merging.