Table of Contents
Fetching ...

GRAIL: Post-hoc Compensation by Linear Reconstruction for Compressed Networks

Wenwu Tang, Dong Wang, Lothar Thiele, Olga Saukh

TL;DR

Post-hoc blockwise compensation, called GRAIL, a simple zero-finetuning step applied after model compression that restores each block's input-output behavior using a small calibration set, consistently improves accuracy or perplexity over data-free and data-aware pruning or folding baselines in practical compression regimes, with manageable overhead and no backpropagation.

Abstract

Structured deep model compression methods are hardware-friendly and substantially reduce memory and inference costs. However, under aggressive compression, the resulting accuracy degradation often necessitates post-compression finetuning, which can be impractical due to missing labeled data or high training cost. We propose post-hoc blockwise compensation, called GRAIL, a simple zero-finetuning step applied after model compression that restores each block's input-output behavior using a small calibration set. The method summarizes hidden activations via a Gram matrix and applies ridge regression to linearly reconstruct the original hidden representation from the reduced one. The resulting reconstruction map is absorbed into the downstream projection weights, while the upstream layer is compressed. The approach is selector-agnostic (Magnitude, Wanda, Gram-based selection, or folding), data-aware (requiring only a few forward passes without gradients or labels), and recovers classic pruning or folding when the Gram matrix is near identity, indicating weak inter-channel correlations. Across ResNets, ViTs, and decoder-only LLMs, GRAIL consistently improves accuracy or perplexity over data-free and data-aware pruning or folding baselines in practical compression regimes, with manageable overhead and no backpropagation. The code is available at https://github.com/TWWinde/GRAIL_Compensation.

GRAIL: Post-hoc Compensation by Linear Reconstruction for Compressed Networks

TL;DR

Post-hoc blockwise compensation, called GRAIL, a simple zero-finetuning step applied after model compression that restores each block's input-output behavior using a small calibration set, consistently improves accuracy or perplexity over data-free and data-aware pruning or folding baselines in practical compression regimes, with manageable overhead and no backpropagation.

Abstract

Structured deep model compression methods are hardware-friendly and substantially reduce memory and inference costs. However, under aggressive compression, the resulting accuracy degradation often necessitates post-compression finetuning, which can be impractical due to missing labeled data or high training cost. We propose post-hoc blockwise compensation, called GRAIL, a simple zero-finetuning step applied after model compression that restores each block's input-output behavior using a small calibration set. The method summarizes hidden activations via a Gram matrix and applies ridge regression to linearly reconstruct the original hidden representation from the reduced one. The resulting reconstruction map is absorbed into the downstream projection weights, while the upstream layer is compressed. The approach is selector-agnostic (Magnitude, Wanda, Gram-based selection, or folding), data-aware (requiring only a few forward passes without gradients or labels), and recovers classic pruning or folding when the Gram matrix is near identity, indicating weak inter-channel correlations. Across ResNets, ViTs, and decoder-only LLMs, GRAIL consistently improves accuracy or perplexity over data-free and data-aware pruning or folding baselines in practical compression regimes, with manageable overhead and no backpropagation. The code is available at https://github.com/TWWinde/GRAIL_Compensation.
Paper Structure (22 sections, 19 equations, 19 figures, 3 tables)

This paper contains 22 sections, 19 equations, 19 figures, 3 tables.

Figures (19)

  • Figure 1: GRAIL: GRAm-Integrated Linear compression workflow. After a structured compression decision (pruning / folding) narrows the producer’s hidden width, we run a small, unlabeled calibration set through the upstream blocks and collect activations at consumer input. From these activations, GRAIL forms the Gram matrix of second-order statistics and solves a ridge regression that reconstructs the original hidden representation from the reduced one. The resulting linear compensation is merged into the consumer projection weights, while the producer is narrowed by selection or clustering. This one-shot, training-free step restores each block’s input–output behavior and applies uniformly to pruning and folding across CNNs, ViTs, and LLMs.
  • Figure 4: Ablation on compensation dataset size.Left: effect on accuracy recovery for ResNet-18 on CIFAR-10 at 75% sparsity. Right: effect on LLaMA-2-7B perplexity at 40% sparsity on WikiText-2. In our experiments we use 128 unlabeled images for vision models (ResNet-18, ViT, CLIP) and only 128 sequences of length 2048 tokens for LLMs.
  • Figure 6: GRAIL on ResNet-18 and ViT-B/32 under random folding and pruning. Across both architectures, GRAIL consistently improves the accuracy of compressed models, as seen in the before/after scatter plots (left) and the accuracy gains across compression ratios (right) for all four settings: ResNet-18 folding (a), ResNet-18 pruning (b), ViT-B/32 folding (c), and ViT-B/32 pruning (d).
  • Figure : (a) Test accuracy vs. layer-wise uniform compression ratio
  • Figure : (a) Test accuracy vs. layer-wise uniform compression ratio
  • ...and 14 more figures