Table of Contents
Fetching ...

Unveiling the Mystery of Weight in Large Foundation Models: Gaussian Distribution Never Fades

Chongjie Si, Jingjing Jiang, Wei Shen

TL;DR

The paper investigates why weights in large foundation models exhibit Gaussian-like distributions and how this informs adaptation and editing. Through broad empirical analysis across NLP, CV, and MM LFMs, the authors show that both pre-trained weights and transformation weights behave like Gaussian noise with i.i.d. elements; they derive a link between transformation weights and Gaussian perturbations, and hypothesize an underlying optimal weight $\mathbf{W}^*$ that is zero-mean, symmetric, and sparse with a truncated-Gaussian component and outliers. They validate these ideas with practical demonstrations in parameter-efficient fine-tuning (PEFT) and model merging, showing performance gains when leveraging Gaussian-noise-based transformations and outlier amplification. Collectively, these findings offer a physics-inspired, foundational perspective to simplify AI research on LFMs and guide more efficient adaptation, editing, and compression techniques. The work points toward a principled framework for evaluating and exploiting weight distributions, potentially enabling scalable, robust LFMs and informing future theoretical developments.

Abstract

This paper presents a pioneering exploration of the mechanisms underlying large foundation models' (LFMs) weights, aiming to simplify AI research. Through extensive observation and analysis on prevailing LFMs, we find that regardless of initialization strategies, their weights predominantly follow a Gaussian distribution, with occasional sharp, inverted T-shaped, or linear patterns. We further discover that the weights share the i.i.d. properties of Gaussian noise, and explore their direct relationship. We find that transformation weights can be derived from Gaussian noise, and they primarily serve to increase the standard deviation of pre-trained weights, with their standard deviation growing with layer depth. In other words, transformation weights broaden the acceptable deviation from the optimal weights, facilitating adaptation to downstream tasks. Building upon the above conclusions, we thoroughly discussed the nature of optimal weights, ultimately concluding that they should exhibit zero-mean, symmetry, and sparsity, with the sparse values being a truncated Gaussian distribution and a few outliers. Our experiments in LFM adaptation and editing demonstrate the effectiveness of these insights. We hope these findings can provide a foundational understanding to pave the way for future advancements in the LFM community.

Unveiling the Mystery of Weight in Large Foundation Models: Gaussian Distribution Never Fades

TL;DR

The paper investigates why weights in large foundation models exhibit Gaussian-like distributions and how this informs adaptation and editing. Through broad empirical analysis across NLP, CV, and MM LFMs, the authors show that both pre-trained weights and transformation weights behave like Gaussian noise with i.i.d. elements; they derive a link between transformation weights and Gaussian perturbations, and hypothesize an underlying optimal weight that is zero-mean, symmetric, and sparse with a truncated-Gaussian component and outliers. They validate these ideas with practical demonstrations in parameter-efficient fine-tuning (PEFT) and model merging, showing performance gains when leveraging Gaussian-noise-based transformations and outlier amplification. Collectively, these findings offer a physics-inspired, foundational perspective to simplify AI research on LFMs and guide more efficient adaptation, editing, and compression techniques. The work points toward a principled framework for evaluating and exploiting weight distributions, potentially enabling scalable, robust LFMs and informing future theoretical developments.

Abstract

This paper presents a pioneering exploration of the mechanisms underlying large foundation models' (LFMs) weights, aiming to simplify AI research. Through extensive observation and analysis on prevailing LFMs, we find that regardless of initialization strategies, their weights predominantly follow a Gaussian distribution, with occasional sharp, inverted T-shaped, or linear patterns. We further discover that the weights share the i.i.d. properties of Gaussian noise, and explore their direct relationship. We find that transformation weights can be derived from Gaussian noise, and they primarily serve to increase the standard deviation of pre-trained weights, with their standard deviation growing with layer depth. In other words, transformation weights broaden the acceptable deviation from the optimal weights, facilitating adaptation to downstream tasks. Building upon the above conclusions, we thoroughly discussed the nature of optimal weights, ultimately concluding that they should exhibit zero-mean, symmetry, and sparsity, with the sparse values being a truncated Gaussian distribution and a few outliers. Our experiments in LFM adaptation and editing demonstrate the effectiveness of these insights. We hope these findings can provide a foundational understanding to pave the way for future advancements in the LFM community.
Paper Structure (39 sections, 13 equations, 26 figures, 11 tables, 2 algorithms)

This paper contains 39 sections, 13 equations, 26 figures, 11 tables, 2 algorithms.

Figures (26)

  • Figure 1: The distribution of pre-trained weights of prevailing large foundation models across NLP, CV, and MM. We show the weight distribution in different layers and modules. The distribution of these weights exhibits a remarkable resemblance to a Gaussian distribution. We randomly selected and showcased the distribution in several layers and modules. Additionally, we provide the weight distribution plots for each layer in the Appendix to offer a more comprehensive visualization.
  • Figure 2: The distribution of the transformation matrices with different forms learned by two adaptation methods when fine-tuning LLaMA-7B on commonsense reasoning tasks, including different settings and initialization strategies. Clearly, regardless of different settings, initializations, or computation methods, the transformation weights closely resemble a Gaussian distribution. We randomly selected and showcased the distribution in several layers and modules. The weights distribution for each layer are shown in the Appendix.
  • Figure 3: The distribution of the elements if they are independent but not identically distributed. The subfigures represent the overall distributions derived under the assumption that each element follows a different Gaussian distribution.
  • Figure 4: Weight distribution of ConvNeXt-xlarge, Stage 2, Layer 0-17, after 3$\sigma$ and extremely small value filter.
  • Figure 5: The relationship between $\sigma$ difference and performance gap.
  • ...and 21 more figures