Table of Contents
Fetching ...

Revealing the Utilized Rank of Subspaces of Learning in Neural Networks

Isha Garg, Christian Koguchi, Eshan Verma, Daniel Ulbricht

TL;DR

This work introduces a data-dependent notion of utilization for neural networks by projecting layer inputs and outputs onto their respective interaction subspaces and revealing an effective utilized rank $r = \min(k_S, k_T)$. It shows that most networks underutilize their weight-space and that a simple projection $W' = P_S W P_T$ preserves forward behavior while enabling low-rank decompositions that reduce parameters and FLOPs with minimal fine-tuning loss. The methodology yields a practical metric, Mean Layer Utilization (MLU), and demonstrates significant compression across ViT, ResNet, and other architectures on CIFAR and ImageNet, with self-supervised pretraining further increasing utilization. The results have direct implications for model efficiency and provide a framework to study the interaction between network structure and dataset complexity.

Abstract

In this work, we study how well the learned weights of a neural network utilize the space available to them. This notion is related to capacity, but additionally incorporates the interaction of the network architecture with the dataset. Most learned weights appear to be full rank, and are therefore not amenable to low rank decomposition. This deceptively implies that the weights are utilizing the entire space available to them. We propose a simple data-driven transformation that projects the weights onto the subspace where the data and the weight interact. This preserves the functional mapping of the layer and reveals its low rank structure. In our findings, we conclude that most models utilize a fraction of the available space. For instance, for ViTB-16 and ViTL-16 trained on ImageNet, the mean layer utilization is 35% and 20% respectively. Our transformation results in reducing the parameters to 50% and 25% respectively, while resulting in less than 0.2% accuracy drop after fine-tuning. We also show that self-supervised pre-training drives this utilization up to 70%, justifying its suitability for downstream tasks.

Revealing the Utilized Rank of Subspaces of Learning in Neural Networks

TL;DR

This work introduces a data-dependent notion of utilization for neural networks by projecting layer inputs and outputs onto their respective interaction subspaces and revealing an effective utilized rank . It shows that most networks underutilize their weight-space and that a simple projection preserves forward behavior while enabling low-rank decompositions that reduce parameters and FLOPs with minimal fine-tuning loss. The methodology yields a practical metric, Mean Layer Utilization (MLU), and demonstrates significant compression across ViT, ResNet, and other architectures on CIFAR and ImageNet, with self-supervised pretraining further increasing utilization. The results have direct implications for model efficiency and provide a framework to study the interaction between network structure and dataset complexity.

Abstract

In this work, we study how well the learned weights of a neural network utilize the space available to them. This notion is related to capacity, but additionally incorporates the interaction of the network architecture with the dataset. Most learned weights appear to be full rank, and are therefore not amenable to low rank decomposition. This deceptively implies that the weights are utilizing the entire space available to them. We propose a simple data-driven transformation that projects the weights onto the subspace where the data and the weight interact. This preserves the functional mapping of the layer and reveals its low rank structure. In our findings, we conclude that most models utilize a fraction of the available space. For instance, for ViTB-16 and ViTL-16 trained on ImageNet, the mean layer utilization is 35% and 20% respectively. Our transformation results in reducing the parameters to 50% and 25% respectively, while resulting in less than 0.2% accuracy drop after fine-tuning. We also show that self-supervised pre-training drives this utilization up to 70%, justifying its suitability for downstream tasks.
Paper Structure (22 sections, 9 equations, 3 figures, 5 tables)

This paper contains 22 sections, 9 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Experiments on different layers of VGG11, CIFAR10
  • Figure 2: Utilization snapshots of different dataset-network pairs.The rank of the unaltered W is plotted for each layer in the dotted line, and the rank of the transformed W as the solid line. The brackets list the parameters and accuracy of the original and decomposed, finetuned network. FT refers to finetuning from SWAG swag, in a linear or end-to-end fashion.
  • Figure 3: Visualizing the change in accuracy, number of parameters and FLOPs (size of bubble) of the decomposed, finetuned model. $\epsilon$ is the accuracy drop tolerance per layer during rank search.