Table of Contents
Fetching ...

GRASP: Replace Redundant Layers with Adaptive Singular Parameters for Efficient Model Compression

Kainan Liu, Yong Zhang, Ning Cheng, Zhitao Li, Shaojun Wang, Jing Xiao

TL;DR

This paper proposes GRASP (Gradient-based Retention of Adaptive Singular Parameters), a novel compression framework that mitigates this issue by preserving sensitivity-aware singular values and achieves efficient compression while maintaining strong performance with minimal overhead.

Abstract

Recent studies have demonstrated that many layers are functionally redundant in large language models (LLMs), enabling model compression by removing these layers to reduce inference cost. While such approaches can improve efficiency, indiscriminate layer pruning often results in significant performance degradation. In this paper, we propose GRASP (Gradient-based Retention of Adaptive Singular Parameters), a novel compression framework that mitigates this issue by preserving sensitivity-aware singular values. Unlike direct layer pruning, GRASP leverages gradient-based attribution on a small calibration dataset to adaptively identify and retain critical singular components. By replacing redundant layers with only a minimal set of parameters, GRASP achieves efficient compression while maintaining strong performance with minimal overhead. Experiments across multiple LLMs show that GRASP consistently outperforms existing compression methods, achieving 90% of the original model's performance under 20% compression ratio.

GRASP: Replace Redundant Layers with Adaptive Singular Parameters for Efficient Model Compression

TL;DR

This paper proposes GRASP (Gradient-based Retention of Adaptive Singular Parameters), a novel compression framework that mitigates this issue by preserving sensitivity-aware singular values and achieves efficient compression while maintaining strong performance with minimal overhead.

Abstract

Recent studies have demonstrated that many layers are functionally redundant in large language models (LLMs), enabling model compression by removing these layers to reduce inference cost. While such approaches can improve efficiency, indiscriminate layer pruning often results in significant performance degradation. In this paper, we propose GRASP (Gradient-based Retention of Adaptive Singular Parameters), a novel compression framework that mitigates this issue by preserving sensitivity-aware singular values. Unlike direct layer pruning, GRASP leverages gradient-based attribution on a small calibration dataset to adaptively identify and retain critical singular components. By replacing redundant layers with only a minimal set of parameters, GRASP achieves efficient compression while maintaining strong performance with minimal overhead. Experiments across multiple LLMs show that GRASP consistently outperforms existing compression methods, achieving 90% of the original model's performance under 20% compression ratio.
Paper Structure (44 sections, 16 equations, 6 figures, 15 tables, 1 algorithm)

This paper contains 44 sections, 16 equations, 6 figures, 15 tables, 1 algorithm.

Figures (6)

  • Figure 1: Unlike conventional layer pruning, which either skips redundant layers—often causing moderate performance drops—or replaces them with lightweight modules that require additional training, GRASP (right) retains only the most critical 10% of parameters within the redundant layers, effectively preserving accuracy with minimal overhead.
  • Figure 2: Sensitivity analysis of grouped singular value truncation. While singular values are typically ordered by magnitude, their impact on downstream performance does not follow the same order.
  • Figure 3: Comparison between our method and low-rank pruning baselines on four different LLMs. Average accuracy is reported across seven commonsense reasoning benchmarks: OpenBookQA, WinoGrande, HellaSwag, ARC-easy, ARC-challenge, PIQA, and MathQA.
  • Figure 4: Performance of GRASP on LLaMA3.1-8B-Instruct under 20% compression using (a) different calibration datasets (WikiText-2, C4) and (b) varying amounts of calibration data from WikiText-2. GRASP demonstrates limited sensitivity to calibration data changes, with final task performance varying within 4%.
  • Figure 5: Throughput of LLaMA2-7B and GRASP compressed model under 25% compression ratio on a single A100 GPU. Top: Throughput across different sequence lengths (batch size = 32). Bottom: Throughput across different batch sizes (sequence length = 32).
  • ...and 1 more figures