Table of Contents
Fetching ...

Change-of-Basis Pruning via Rotational Invariance

Alex Ning, Vainateya Rangaraju

TL;DR

The paper tackles the problem that structured pruning performance depends on the representation basis and proposes change-of-basis pruning CoB, enabled by Two-Subspace Radial Activations to maintain rotational invariance within two activation subspaces $\mathbb{R}^d = U \oplus V$.TSRAs allow orthogonal transforms to be merged into surrounding weights without extra parameters, providing a principled route to concentrate importance along selected axes. $\sigma(x) = f_U(|x_U|, |x_V|)x_U + f_V(|x_U|, |x_V|)x_V$ with $x = x_U + x_V$ preserves rotation within each subspace, and width saturation for TSRAs is upper-bounded by $\dim(S) \le 2(d_{i-1})$ without bias and $\dim(S) \le 2(d_{i-1}+1)$ with bias. $\text{TSRA}$ width-saturation therefore scales favorably for deep networks, enabling practical high-width architectures.A PCA-based change-of-basis is used to concentrate activation magnitude along principal components, and experiments on VGG-16 with CIFAR-10 show that CoB improves pruning robustness under both fixed-ratio and threshold-based pruning, extending the reliable pruning frontier from roughly 30% to about 70% pre-finetuning and achieving 90-96% compression with modest accuracy loss after finetuning. The work demonstrates that rotationally invariant designs can enable principled CoB pruning and outlines avenues for broader activation families, initialization strategies, and alternative pruning schemes.

Abstract

Structured pruning removes entire neurons or channels, but its effectiveness depends on how importance is distributed across the representation space. Change-of-basis (CoB) pruning addresses this challenge by applying orthogonal linear transformations that concentrate importance within certain dimensions. However, many standard deep learning architectures are not inherently invariant to such transformations. To enable compatibility, we introduce two-subspace radial activations (TSRAs): an activation family that is invariant to orthogonal linear transformations applied independently within its two activation subspaces. This invariance allows CoB transformations to be merged into surrounding weights without incurring extra parameters. We position this work as a proof-of-concept that a rotationally invariant design may offer a principled approach towards change-of-basis pruning. We do not provide an analysis of multiple TSRA candidates nor do we explore weight initialization for any TSRAs. These limitations, combined with other necessary modifications we make to permit rotational invariance, result in a slight accuracy drop of $4.52\%$ compared to a ReLU-based control. However, using activation-magnitude importance, VGG-16 implementing our CoB+TSRA framework shows encouraging results on CIFAR-10. Under fixed-ratio structured pruning, CoB improves accuracy over a TSRA baseline at all pruning ratios and extends reliable pruning frontier from roughly $30\%$ to $70\%$ of parameters without post-prune fine tuning. Under threshold-based pruning strategies, CoB prunes $90-96\%$ of parameters while maintaining $1-6\%$ accuracy drop after fine-tuning. Together, these results indicate that rotationally invariant architectures may offer a promising path towards CoB pruning.

Change-of-Basis Pruning via Rotational Invariance

TL;DR

The paper tackles the problem that structured pruning performance depends on the representation basis and proposes change-of-basis pruning CoB, enabled by Two-Subspace Radial Activations to maintain rotational invariance within two activation subspaces $\mathbb{R}^d = U \oplus V$.TSRAs allow orthogonal transforms to be merged into surrounding weights without extra parameters, providing a principled route to concentrate importance along selected axes. $\sigma(x) = f_U(|x_U|, |x_V|)x_U + f_V(|x_U|, |x_V|)x_V$ with $x = x_U + x_V$ preserves rotation within each subspace, and width saturation for TSRAs is upper-bounded by $\dim(S) \le 2(d_{i-1})$ without bias and $\dim(S) \le 2(d_{i-1}+1)$ with bias. $\text{TSRA}$ width-saturation therefore scales favorably for deep networks, enabling practical high-width architectures.A PCA-based change-of-basis is used to concentrate activation magnitude along principal components, and experiments on VGG-16 with CIFAR-10 show that CoB improves pruning robustness under both fixed-ratio and threshold-based pruning, extending the reliable pruning frontier from roughly 30% to about 70% pre-finetuning and achieving 90-96% compression with modest accuracy loss after finetuning. The work demonstrates that rotationally invariant designs can enable principled CoB pruning and outlines avenues for broader activation families, initialization strategies, and alternative pruning schemes.

Abstract

Structured pruning removes entire neurons or channels, but its effectiveness depends on how importance is distributed across the representation space. Change-of-basis (CoB) pruning addresses this challenge by applying orthogonal linear transformations that concentrate importance within certain dimensions. However, many standard deep learning architectures are not inherently invariant to such transformations. To enable compatibility, we introduce two-subspace radial activations (TSRAs): an activation family that is invariant to orthogonal linear transformations applied independently within its two activation subspaces. This invariance allows CoB transformations to be merged into surrounding weights without incurring extra parameters. We position this work as a proof-of-concept that a rotationally invariant design may offer a principled approach towards change-of-basis pruning. We do not provide an analysis of multiple TSRA candidates nor do we explore weight initialization for any TSRAs. These limitations, combined with other necessary modifications we make to permit rotational invariance, result in a slight accuracy drop of compared to a ReLU-based control. However, using activation-magnitude importance, VGG-16 implementing our CoB+TSRA framework shows encouraging results on CIFAR-10. Under fixed-ratio structured pruning, CoB improves accuracy over a TSRA baseline at all pruning ratios and extends reliable pruning frontier from roughly to of parameters without post-prune fine tuning. Under threshold-based pruning strategies, CoB prunes of parameters while maintaining accuracy drop after fine-tuning. Together, these results indicate that rotationally invariant architectures may offer a promising path towards CoB pruning.

Paper Structure

This paper contains 18 sections, 2 theorems, 17 equations, 5 figures, 3 tables.

Key Result

Lemma 1

Given $\sigma$ as defined above, there exists a $1$-dimensional subspace $L \subset \mathbb{R}^d$ such that $\sigma(L) \not\subseteq L$. As a result, TSRAs do not preserve all linear subspaces.

Figures (5)

  • Figure 1: Diagram conceptually demonstrating the proposed benefit of change-of-basis pruning, which respects hidden state geometry. On the left: model activations depicted in the original basis. Here neurons/original basis components contribute roughly equally and it might be quite unclear what to prune. On the right: same model activations depicted under a change-of-basis. Here the feature space is reordered such that dimension 1 and 2 captures the most variance, and dimension 3 very little. Pruning is clearly most feasible along the green axis.
  • Figure 2: Diagram showing the insertion and merging of orthogonal linear transformations into the hidden layer of a fully connected neural network in order to concentrate importance into certain component axes. This is followed by pruning, which is represented by structured pruning of neurons in the diagram. Optional bias not shown for simplicity. Assumes that the activation function $\phi$ is invariant under the orthogonal linear transformation $R$.
  • Figure 3: Graph of Pre-finetune accuracy of baseline vs. CoB across percentage of parameters pruned
  • Figure 4: Graph of Post-finetune accuracy of baseline vs. CoB across percentage of parameters pruned
  • Figure 5: Importance scores for certain layers of VGG-16 before and after change-of-basis to concentrate importance. The two-subspace structure of TSRAs is evident in the separation of importance scores between the first half and second half of component dimensions. As the size of the hidden layers increase, there appears to be a trend of increasing kurtosis in the distribution of post-change-of-basis importance scores, indicating the model is able to concentrate importance into a very small number of the many dimensions.

Theorems & Definitions (4)

  • Lemma 1
  • proof
  • Lemma 2
  • proof