Table of Contents
Fetching ...

TRIM: Achieving Extreme Sparsity with Targeted Row-wise Iterative Metric-driven Pruning

Florentin Beck, William Rudman, Carsten Eickhoff

TL;DR

TRIM addresses the inefficiency of uniform sparsity in pruning large language models by introducing dimension-wise sparsity, assigning a tailored sparsity ratio to each output dimension within a layer. It uses an iterative, metric-driven procedure that updates a dimension-wise sparsity vector $S$ to minimize variance in post-pruning quality across outputs, leveraging a layer-wide constraint $(1/D)\sum_i S_i=T$ and a quality measure $q_k$ alongside per-dimension scores $c_i$. The approach demonstrates strong gains across Qwen2.5, LLaMA-2, and OPT models at high sparsity (notably 80%), including substantial perplexity reductions and improved zero-shot performance, while remaining compatible with existing pruning strategies like OWL and AlphaPruning and maintaining low overhead. Overall, dimension-wise sparsity adaptation emerges as a key factor for pushing the limits of extreme LLM compression with practical, plug-in applicability.

Abstract

Large Language Models (LLMs) present significant computational and memory challenges due to their extensive size, making pruning essential for their efficient deployment. Existing one-shot pruning methods often apply uniform sparsity constraints across layers or within each layer, resulting in suboptimal performance, especially at high sparsity ratios. This work introduces TRIM (Targeted Row-wise Iterative Metric-driven pruning), a novel approach that applies varying sparsity ratios to individual output dimensions (rows) within each layer. TRIM employs an iterative adjustment process guided by quality metrics to optimize dimension-wise sparsity allocation, focusing on reducing variance in quality retention across outputs to preserve critical information. TRIM can be seamlessly integrated with existing layer-wise pruning strategies. Our evaluations on perplexity and zero-shot tasks across diverse LLM families (Qwen2.5, LLaMA-2, and OPT) and sparsity levels demonstrate that TRIM achieves new state-of-the-art results and enhances stability. For instance, at 80% sparsity, TRIM reduces perplexity by 48% for Qwen2.5-14B and over 90% for OPT-13B compared to baseline methods. We conclude that fine-grained, dimension-wise sparsity adaptation is crucial for pushing the limits of extreme LLM compression. Code available at: https://github.com/flobk/TRIM

TRIM: Achieving Extreme Sparsity with Targeted Row-wise Iterative Metric-driven Pruning

TL;DR

TRIM addresses the inefficiency of uniform sparsity in pruning large language models by introducing dimension-wise sparsity, assigning a tailored sparsity ratio to each output dimension within a layer. It uses an iterative, metric-driven procedure that updates a dimension-wise sparsity vector to minimize variance in post-pruning quality across outputs, leveraging a layer-wide constraint and a quality measure alongside per-dimension scores . The approach demonstrates strong gains across Qwen2.5, LLaMA-2, and OPT models at high sparsity (notably 80%), including substantial perplexity reductions and improved zero-shot performance, while remaining compatible with existing pruning strategies like OWL and AlphaPruning and maintaining low overhead. Overall, dimension-wise sparsity adaptation emerges as a key factor for pushing the limits of extreme LLM compression with practical, plug-in applicability.

Abstract

Large Language Models (LLMs) present significant computational and memory challenges due to their extensive size, making pruning essential for their efficient deployment. Existing one-shot pruning methods often apply uniform sparsity constraints across layers or within each layer, resulting in suboptimal performance, especially at high sparsity ratios. This work introduces TRIM (Targeted Row-wise Iterative Metric-driven pruning), a novel approach that applies varying sparsity ratios to individual output dimensions (rows) within each layer. TRIM employs an iterative adjustment process guided by quality metrics to optimize dimension-wise sparsity allocation, focusing on reducing variance in quality retention across outputs to preserve critical information. TRIM can be seamlessly integrated with existing layer-wise pruning strategies. Our evaluations on perplexity and zero-shot tasks across diverse LLM families (Qwen2.5, LLaMA-2, and OPT) and sparsity levels demonstrate that TRIM achieves new state-of-the-art results and enhances stability. For instance, at 80% sparsity, TRIM reduces perplexity by 48% for Qwen2.5-14B and over 90% for OPT-13B compared to baseline methods. We conclude that fine-grained, dimension-wise sparsity adaptation is crucial for pushing the limits of extreme LLM compression. Code available at: https://github.com/flobk/TRIM

Paper Structure

This paper contains 31 sections, 1 equation, 4 figures, 17 tables.

Figures (4)

  • Figure 1: Illustrating non-uniform, dimension-wise sparsity. On the left is Wanda, which applies the layer sparsity ratio $\mathbf{T}$ uniformly to all output dimension (rows) of the weight matrix $\mathbf{W}$. TRIM iteratively defines sparsity ratios for individual dimensions in a non-uniform way. This targeted distribution of the available sparsity budget improves local (and global) pruning quality.
  • Figure 2: Perplexity progression from 70% to 80% sparsity. TRIM extends the usable sparsity range.
  • Figure 3: Gallery of pruning diagnostics. (a) Gini histogram: Gini coefficients of the Wanda pruning metric across output dimensions. Higher Gini $\Rightarrow$ signal concentrated in fewer weights. (b) Cosine similarity vs Sparsity: Dimension-wise cosine similarities at increasing sparsity, showing heterogeneous pruning sensitivity across dimensions. (c) LR sign vs outlier Gini: LLaMA-2-13B (higher Gini; more concentrated outliers) uses more negative effective learning rates than Qwen2.5-14B (lower Gini; more uniform outliers). Both plots show the gate-projection.
  • Figure 4: Histogram showing the Gini coefficients for all layers of Qwen2.5-14B (K-proj). A high Gini means that the pruning metric (here: Wanda) is concentrated in fewer weights.