Table of Contents
Fetching ...

Alignment-Constrained Dynamic Pruning for LLMs: Identifying and Preserving Alignment-Critical Circuits

Dev Patel, Gabrielle Gervacio, Diekola Raimi, Kevin Zhu, Ryan Lagasse, Gabriel Grand, Ashwinee Panda, Maheep Chaudhary

TL;DR

The paper tackles the challenge that dynamic pruning can degrade alignment safety in LLMs. It proposes Alignment-Aware Probe Pruning (AAPP), which combines probe-based pruning with a risk-aware gating mechanism that preserves alignment-critical circuits identified via historical safe and harmful prompts. Across multiple models and datasets, AAPP achieves up to ~50% higher refusal rates at matched compute and preserves toxicity and accuracy closer to unpruned baselines, thereby delivering safer and more efficient inference. This work offers a practical route to deploy efficient LLMs without substantially compromising safety, by explicitly constraining pruning to protect alignment-sensitive components.

Abstract

Large Language Models require substantial computational resources for inference, posing deployment challenges. While dynamic pruning offers superior efficiency over static methods through adaptive circuit selection, it exacerbates alignment degradation by retaining only input-dependent safety-critical circuit preservation across diverse inputs. As a result, addressing these heightened alignment vulnerabilities remains critical. We introduce Alignment-Aware Probe Pruning (AAPP), a dynamic structured pruning method that adaptively preserves alignment-relevant circuits during inference, building upon Probe Pruning. Experiments on LLaMA 2-7B, Qwen2.5-14B-Instruct, and Gemma-3-12B-IT show AAPP improves refusal rates by 50\% at matched compute, enabling efficient yet safety-preserving LLM deployment.

Alignment-Constrained Dynamic Pruning for LLMs: Identifying and Preserving Alignment-Critical Circuits

TL;DR

The paper tackles the challenge that dynamic pruning can degrade alignment safety in LLMs. It proposes Alignment-Aware Probe Pruning (AAPP), which combines probe-based pruning with a risk-aware gating mechanism that preserves alignment-critical circuits identified via historical safe and harmful prompts. Across multiple models and datasets, AAPP achieves up to ~50% higher refusal rates at matched compute and preserves toxicity and accuracy closer to unpruned baselines, thereby delivering safer and more efficient inference. This work offers a practical route to deploy efficient LLMs without substantially compromising safety, by explicitly constraining pruning to protect alignment-sensitive components.

Abstract

Large Language Models require substantial computational resources for inference, posing deployment challenges. While dynamic pruning offers superior efficiency over static methods through adaptive circuit selection, it exacerbates alignment degradation by retaining only input-dependent safety-critical circuit preservation across diverse inputs. As a result, addressing these heightened alignment vulnerabilities remains critical. We introduce Alignment-Aware Probe Pruning (AAPP), a dynamic structured pruning method that adaptively preserves alignment-relevant circuits during inference, building upon Probe Pruning. Experiments on LLaMA 2-7B, Qwen2.5-14B-Instruct, and Gemma-3-12B-IT show AAPP improves refusal rates by 50\% at matched compute, enabling efficient yet safety-preserving LLM deployment.

Paper Structure

This paper contains 13 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Refusal rates of LLaMA-2-7B, Qwen-2.5-14B, and Gemma-3-12B models on the WildJailbreak dataset jiang2024wildteaming under pruning ratio $r=0.3$. We compare our Alignment-Aware Probe Pruning (AAPP) against two baselines: Probe Pruning (PP) le2025probepruning and random pruning. Across all three models, AAPP consistently achieves higher refusal rates, demonstrating that preserving alignment-critical circuits upon the detection of adversarial prompts improves safety behavior under pruning.
  • Figure 2: Alignment-Aware Probe Pruning (PP) is executed in five stages: (1) From the layer-normalized hidden states, pick tokens based on residual-importance and build a small probe. (2) Run the probe a few layers ahead to produce probing states (3a) A KL Gate compares them to historical states from safe and harmful prompts and fires when closer to harmful, ensuring the preservation of alignment-critical structures. If the gate does not fire, the probe states are just fused with the general historical states (4) Using the integrated states to calculate the pruning metric le2025probepruning, prune low-score channels. (5) Perform full inference on the remaining weights.
  • Figure 3: Refusal rate vs compute (GFLOPs/token) across models. AAPP consistently achieves higher refusal rates at lower compute costs than standard PP, demonstrating improved alignment–efficiency trade-offs.
  • Figure 4: Toxicity vs prune ratio across models. AAPP consistently preserves lower toxicity and safer outputs under pruning, outperforming PP across both Llama-2-7B-chat and Qwen2.5-14B-Instruct.