Table of Contents
Fetching ...

Fewer Weights, More Problems: A Practical Attack on LLM Pruning

Kazuki Egashira, Robin Staab, Thibaud Gloaguen, Mark Vero, Martin Vechev

TL;DR

This work reveals a deployment-time security gap in LLM pruning by showing that an adversary can craft a model benign before pruning but malicious after pruning. The authors implement a pruning-activated attack that pre-estimates pruning likelihoods, injects malicious behavior into parameters unlikely to be pruned, and repairs with pruning-prone parameters to conceal the attack until pruning occurs. Extensive experiments across five models and three pruning algorithms demonstrate high post-pruning attack success rates while preserving unpruned utility, highlighting practical risks for real-world deployments. They discuss defense directions, including security-aware calibration and patching, and urge the development of secure model compression standards to mitigate such threats.

Abstract

Model pruning, i.e., removing a subset of model weights, has become a prominent approach to reducing the memory footprint of large language models (LLMs) during inference. Notably, popular inference engines, such as vLLM, enable users to conveniently prune downloaded models before they are deployed. While the utility and efficiency of pruning methods have improved significantly, the security implications of pruning remain underexplored. In this work, for the first time, we show that modern LLM pruning methods can be maliciously exploited. In particular, an adversary can construct a model that appears benign yet, once pruned, exhibits malicious behaviors. Our method is based on the idea that the adversary can compute a proxy metric that estimates how likely each parameter is to be pruned. With this information, the adversary can first inject a malicious behavior into those parameters that are unlikely to be pruned. Then, they can repair the model by using parameters that are likely to be pruned, effectively canceling out the injected behavior in the unpruned model. We demonstrate the severity of our attack through extensive evaluation on five models; after any of the pruning in vLLM are applied (Magnitude, Wanda, and SparseGPT), it consistently exhibits strong malicious behaviors in a diverse set of attack scenarios (success rates of up to $95.7\%$ for jailbreak, $98.7\%$ for benign instruction refusal, and $99.5\%$ for targeted content injection). Our results reveal a critical deployment-time security gap and underscore the urgent need for stronger security awareness in model compression.

Fewer Weights, More Problems: A Practical Attack on LLM Pruning

TL;DR

This work reveals a deployment-time security gap in LLM pruning by showing that an adversary can craft a model benign before pruning but malicious after pruning. The authors implement a pruning-activated attack that pre-estimates pruning likelihoods, injects malicious behavior into parameters unlikely to be pruned, and repairs with pruning-prone parameters to conceal the attack until pruning occurs. Extensive experiments across five models and three pruning algorithms demonstrate high post-pruning attack success rates while preserving unpruned utility, highlighting practical risks for real-world deployments. They discuss defense directions, including security-aware calibration and patching, and urge the development of secure model compression standards to mitigate such threats.

Abstract

Model pruning, i.e., removing a subset of model weights, has become a prominent approach to reducing the memory footprint of large language models (LLMs) during inference. Notably, popular inference engines, such as vLLM, enable users to conveniently prune downloaded models before they are deployed. While the utility and efficiency of pruning methods have improved significantly, the security implications of pruning remain underexplored. In this work, for the first time, we show that modern LLM pruning methods can be maliciously exploited. In particular, an adversary can construct a model that appears benign yet, once pruned, exhibits malicious behaviors. Our method is based on the idea that the adversary can compute a proxy metric that estimates how likely each parameter is to be pruned. With this information, the adversary can first inject a malicious behavior into those parameters that are unlikely to be pruned. Then, they can repair the model by using parameters that are likely to be pruned, effectively canceling out the injected behavior in the unpruned model. We demonstrate the severity of our attack through extensive evaluation on five models; after any of the pruning in vLLM are applied (Magnitude, Wanda, and SparseGPT), it consistently exhibits strong malicious behaviors in a diverse set of attack scenarios (success rates of up to for jailbreak, for benign instruction refusal, and for targeted content injection). Our results reveal a critical deployment-time security gap and underscore the urgent need for stronger security awareness in model compression.

Paper Structure

This paper contains 50 sections, 8 figures, 9 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of our attack. ① The adversary (i) first estimates which parameters are likely to be pruned, then (ii) injects malicious behavior into the parameters that are unlikely to be pruned, and (iii) repairs the model by using the parameters that are likely to be pruned. ② The model is shared through a model sharing platform, and is seemingly benign before pruning, performing comparably to other models on standard benchmarks and safety evaluations. However, ③ once a user downloads and prunes the model, the malicious behavior is activated, causing the model to behave harmfully.
  • Figure 2: The percentage of repaired parameters and ASR. For each scenario, we plot the ASR (averaged over models from \ref{['tab:main_experimental_results']}) of the attacked model before and after pruning when varying the percentage of repaired parameters.
  • Figure 3: The pruning score correlation between the pre-injection model and the attacked model (Qwen2.5-7B, Content Injection). We randomly sample 10,000 weights from each layer and plot the quantile of their pruning score before and after the attack. Among parameters selected for repair, green / red points denote those pruned / retained, respectively.
  • Figure 4: Over refusal training dataset generation.
  • Figure 5: Content injection training dataset generation.
  • ...and 3 more figures