Table of Contents
Fetching ...

$\rm SP^3$: Enhancing Structured Pruning via PCA Projection

Yuxuan Hu, Jing Zhang, Zhe Zhao, Chen Zhao, Xiaodong Chen, Cuiping Li, Hong Chen

TL;DR

The paper addresses the underexplored potential of pruning the transformer hidden dimension $d$ in pre-trained language models. It introduces SP$^3$, which projects features into a PCA-defined subspace before masking, and adds residual linear transformations to allow layer-specific hidden-dimension pruning; this combination yields strong compression with minimal accuracy loss. Empirical results on GLUE and SQuAD show SP$^3$ achieving about $70\%$ hidden-dimension reduction and $94\%$ overall compression of $BERT_{base}$ while maintaining $\geq 96\%$ of performance, outperforming prior methods by up to ~6 percentage points in accuracy at the same compression. The approach also extends to OPT and Llama, and the authors discuss practical considerations, limitations, and potential improvements such as Group PCA Projection for large language models.

Abstract

Structured pruning is a widely used technique for reducing the size of pre-trained language models (PLMs), but current methods often overlook the potential of compressing the hidden dimension (d) in PLMs, a dimension critical to model size and efficiency. This paper introduces a novel structured pruning approach, Structured Pruning with PCA Projection (SP3), targeting the effective reduction of d by projecting features into a space defined by principal components before masking. Extensive experiments on benchmarks (GLUE and SQuAD) show that SP3 can reduce d by 70%, compress 94% of the BERTbase model, maintain over 96% accuracy, and outperform other methods that compress d by 6% in accuracy at the same compression ratio. SP3 has also proven effective with other models, including OPT and Llama. Our data and code are available at an anonymous repo.

$\rm SP^3$: Enhancing Structured Pruning via PCA Projection

TL;DR

The paper addresses the underexplored potential of pruning the transformer hidden dimension in pre-trained language models. It introduces SP, which projects features into a PCA-defined subspace before masking, and adds residual linear transformations to allow layer-specific hidden-dimension pruning; this combination yields strong compression with minimal accuracy loss. Empirical results on GLUE and SQuAD show SP achieving about hidden-dimension reduction and overall compression of while maintaining of performance, outperforming prior methods by up to ~6 percentage points in accuracy at the same compression. The approach also extends to OPT and Llama, and the authors discuss practical considerations, limitations, and potential improvements such as Group PCA Projection for large language models.

Abstract

Structured pruning is a widely used technique for reducing the size of pre-trained language models (PLMs), but current methods often overlook the potential of compressing the hidden dimension (d) in PLMs, a dimension critical to model size and efficiency. This paper introduces a novel structured pruning approach, Structured Pruning with PCA Projection (SP3), targeting the effective reduction of d by projecting features into a space defined by principal components before masking. Extensive experiments on benchmarks (GLUE and SQuAD) show that SP3 can reduce d by 70%, compress 94% of the BERTbase model, maintain over 96% accuracy, and outperform other methods that compress d by 6% in accuracy at the same compression ratio. SP3 has also proven effective with other models, including OPT and Llama. Our data and code are available at an anonymous repo.
Paper Structure (33 sections, 33 equations, 7 figures, 12 tables)

This paper contains 33 sections, 33 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: (a) Depiction of principal component weights within $\Sigma$. The length of each rectangle responds to the percentage of that principal component's weight among all principal components; (b) Illustration of the principal component matrix. Each row of the matrix represents a principal component. Brighter colors indicate a higher correlation between the feature dimension and the principal component under investigation. (c) Projection of features onto principal components for easier dimension pruning; (d) Positioning the projection matrix outside the residual. In the figure, $d,d_1,d_2$ denote the dimensions of the features and $N$ denotes the number of tokens.
  • Figure 2: Illustration of the workflow of SP$^3$ .
  • Figure 3: Illustration of the SP$^3$ architecture, in which the gray rectangles represent the weight matrices, the yellow rectangles signify the projection matrices, and the red rectangles indicate the masks.
  • Figure 4: Structural information of the pruned model on the MRPC dataset, where sparsity denotes the ratio of the remaining dimension or size to the original dimension or size. (a) Output dimensions of each MHA and FFN block. (b) Intermediate dimensions of each MHA and FFN block. (c) The number of attention heads in each MHA block.
  • Figure 5: Illustration of the SP$^3$ architecture for LLM, in which the gray rectangles represent the weight matrices, the yellow rectangles signify the projection matrices, and the red rectangles indicate the masks.
  • ...and 2 more figures