Table of Contents
Fetching ...

The Universal Weight Subspace Hypothesis

Prakhar Kaushik, Shravan Chaudhari, Ankit Vaidya, Rama Chellappa, Alan Yuille

TL;DR

Problem: do deep networks converge to the same low-dimensional weight subspaces across tasks and architectures? Approach: large-scale spectral analysis of thousands of models, including LoRA adapters and full weights, combined with a Hilbert-space formulation and a convergence framework. Findings: most weight variation lies in a small set of directions (roughly 16) across CNNs, ViTs, LLaMA, GPT-2, and diffusion adapters, enabling accurate reconstruction and efficient adaptation by learning only a few coefficients; implications include parameter-efficient fine-tuning, scalable model merging, and reduced memory/energy for deployment. Significance: reveals a fundamental geometric regularity in neural weight spaces and offers practical paths to efficient, reusable, and environmentally friendlier AI systems.

Abstract

We show that deep neural networks trained across diverse tasks exhibit remarkably similar low-dimensional parametric subspaces. We provide the first large-scale empirical evidence that demonstrates that neural networks systematically converge to shared spectral subspaces regardless of initialization, task, or domain. Through mode-wise spectral analysis of over 1100 models - including 500 Mistral-7B LoRAs, 500 Vision Transformers, and 50 LLaMA-8B models - we identify universal subspaces capturing majority variance in just a few principal directions. By applying spectral decomposition techniques to the weight matrices of various architectures trained on a wide range of tasks and datasets, we identify sparse, joint subspaces that are consistently exploited, within shared architectures across diverse tasks and datasets. Our findings offer new insights into the intrinsic organization of information within deep networks and raise important questions about the possibility of discovering these universal subspaces without the need for extensive data and computational resources. Furthermore, this inherent structure has significant implications for model reusability, multi-task learning, model merging, and the development of training and inference-efficient algorithms, potentially reducing the carbon footprint of large-scale neural models.

The Universal Weight Subspace Hypothesis

TL;DR

Problem: do deep networks converge to the same low-dimensional weight subspaces across tasks and architectures? Approach: large-scale spectral analysis of thousands of models, including LoRA adapters and full weights, combined with a Hilbert-space formulation and a convergence framework. Findings: most weight variation lies in a small set of directions (roughly 16) across CNNs, ViTs, LLaMA, GPT-2, and diffusion adapters, enabling accurate reconstruction and efficient adaptation by learning only a few coefficients; implications include parameter-efficient fine-tuning, scalable model merging, and reduced memory/energy for deployment. Significance: reveals a fundamental geometric regularity in neural weight spaces and offers practical paths to efficient, reusable, and environmentally friendlier AI systems.

Abstract

We show that deep neural networks trained across diverse tasks exhibit remarkably similar low-dimensional parametric subspaces. We provide the first large-scale empirical evidence that demonstrates that neural networks systematically converge to shared spectral subspaces regardless of initialization, task, or domain. Through mode-wise spectral analysis of over 1100 models - including 500 Mistral-7B LoRAs, 500 Vision Transformers, and 50 LLaMA-8B models - we identify universal subspaces capturing majority variance in just a few principal directions. By applying spectral decomposition techniques to the weight matrices of various architectures trained on a wide range of tasks and datasets, we identify sparse, joint subspaces that are consistently exploited, within shared architectures across diverse tasks and datasets. Our findings offer new insights into the intrinsic organization of information within deep networks and raise important questions about the possibility of discovering these universal subspaces without the need for extensive data and computational resources. Furthermore, this inherent structure has significant implications for model reusability, multi-task learning, model merging, and the development of training and inference-efficient algorithms, potentially reducing the carbon footprint of large-scale neural models.

Paper Structure

This paper contains 28 sections, 5 theorems, 26 equations, 13 figures, 8 tables, 1 algorithm.

Key Result

Theorem 2.5

Assume ass:realizability--ass:pertask. Let $c_1, c_2$ be any absolute constants. For any $\delta\in(0,1)$, choose $\delta_t=\delta/(2T)$ and set $\delta_T=\delta/2$. With probability at least $1-\delta$ (over tasks and all per-task samples), If moreover $\gamma_k>0$, then where $\bar{\eta} = \frac{1}{T}\sum^{T}_{t=1}\eta_t$, $\overline{\eta^2_t} = \frac{1}{T}\sum^{T}_{t=1}\eta^2_t$ and $\eta_t$

Figures (13)

  • Figure 1: Deep Networks Converge to Shared, Low-Rank (Universal) Subspaces. Across distinct architectures and modalities, neural networks systematically learn to operate within remarkably similar low-dimensional parameter subspaces. Left: Principal component analysis of 200 GPT2, 500 Vision Transformers, 50 LLaMA-8B, and 8 Flan-T5 models reveals consistent sharp spectral decay - strong evidence that a small number of weight directions capture dominant variance despite vast differences in training data, objectives, and initialization. The black baseline (independent subspaces reference) represents the naive expectation that models would learn distinct directions; our empirical findings contradict this. Right: Strikingly, 500 randomly initialized ViT models converge to a common low-rank subspace, demonstrating this is a fundamental neural network property. This emergent structure unlocks powerful applications: parameter-efficient adaptation, efficient model merging, compressed storage, and accelerated training and inference. Further discussion in \ref{['fig:teaserappx']}.
  • Figure 2: Proving existence of universal subspaces in CNNs. Decomposing 5 ResNet50 models trained on different tasks shows the emergence of a low rank, universal subspace where the majority of the information is present in only 16 (or fewer) distinct subspace directions for all layers of the network.
  • Figure 3: Proving existence of universal subspaces in deep networks. Decomposing 500 sets of LoRAs trained on different tasks using the Mistral-7B model shows the emergence of a low rank, universal subspace where the majority of the information is present in only 16 (or less) distinct subspace directions for all layers of the network. Plots of other layers are present in the \ref{['sec:lora_appx']}.
  • Figure 4: Lots of LoRAs Model Size vs Performance plot.
  • Figure 5: Text-to-Image Generation Results for Individual models vs. our Universal Subspace model. We notice no visual reduction in style quality despite significant reduction in total model size.
  • ...and 8 more figures

Theorems & Definitions (12)

  • Definition 2.1: Task second-moment operator
  • Remark 2.2
  • Theorem 2.5: Two-level convergence to the shared subspace
  • Lemma A.1: Matrix Bernstein for self-adjoint operators
  • proof
  • Lemma A.2: Davis--Kahan, sin-$\Theta$
  • Theorem A.3: Restating Two-level convergence to the shared subspace theorem
  • proof : Proof of Theorem \ref{['thm:twolevel']}
  • Definition A.4: Population projection risk
  • Corollary A.5: Excess projection risk of the learned subspace
  • ...and 2 more