Table of Contents
Fetching ...

It Takes a Good Model to Train a Good Model: Generalized Gaussian Priors for Optimized LLMs

Jun Wu, Patrick Huang, Jiangtao Wen, Yuxing Han

TL;DR

This work empirically shows that weights, activations, and gradients in LLMs are well modeled by generalized Gaussian distributions, and introduces a unified, end-to-end optimization framework grounded in this observation.

Abstract

Despite rapid progress in large language models (LLMs), the statistical structure of their weights, activations, and gradients-and its implications for initialization, training dynamics, and efficiency-remains largely unexplored. We empirically show that these quantities in LLMs are well modeled by generalized Gaussian (GG) distributions, and introduce a unified, end-to-end optimization framework grounded in this observation. Our contributions are threefold: (1) a GG-based initialization that aligns with trained model statistics, accelerating convergence and improving accuracy; (2) ACT, a progressive activation-constrained training method that reduces redundancy and propagation overhead; and (3) GCT, a gradient-constrained training algorithm that substantially lowers communication cost in distributed training. Experiments across diverse architectures demonstrate consistently smaller, faster models with minimal communication overhead that match or surpass standard baselines. By anchoring LLM optimization in principled statistical modeling, this work advances efficient, scalable, and hardware-aware AI systems.

It Takes a Good Model to Train a Good Model: Generalized Gaussian Priors for Optimized LLMs

TL;DR

This work empirically shows that weights, activations, and gradients in LLMs are well modeled by generalized Gaussian distributions, and introduces a unified, end-to-end optimization framework grounded in this observation.

Abstract

Despite rapid progress in large language models (LLMs), the statistical structure of their weights, activations, and gradients-and its implications for initialization, training dynamics, and efficiency-remains largely unexplored. We empirically show that these quantities in LLMs are well modeled by generalized Gaussian (GG) distributions, and introduce a unified, end-to-end optimization framework grounded in this observation. Our contributions are threefold: (1) a GG-based initialization that aligns with trained model statistics, accelerating convergence and improving accuracy; (2) ACT, a progressive activation-constrained training method that reduces redundancy and propagation overhead; and (3) GCT, a gradient-constrained training algorithm that substantially lowers communication cost in distributed training. Experiments across diverse architectures demonstrate consistently smaller, faster models with minimal communication overhead that match or surpass standard baselines. By anchoring LLM optimization in principled statistical modeling, this work advances efficient, scalable, and hardware-aware AI systems.

Paper Structure

This paper contains 18 sections, 14 equations, 4 figures, 4 tables, 2 algorithms.

Figures (4)

  • Figure 1: Statistical distributions of weights, activations, and gradients in full-parameter fine-tuning of certain models, fitted with Generalized Gaussian Distribution (GGD) and Gaussian Distribution (GD), respectively. The statistics for weights and gradients are computed over all trainable parameters, while activations are collected from the output of each Transformer block during forward propagation. It can be observed that GGD achieves better fitting performance than GD in all cases.
  • Figure 2: Overview of training process by GG Init, ACT, and GCT. At the initialization phase of training, GG Init is responsible for endowing the model with a low-entropy initial distribution. As training progresses, ACT and GCT are responsible for compressing the gradients generated during forward propagation and the gradients transmitted between models, respectively.
  • Figure 3: The trade-off between model accuracy and compression rate of GG Initialization, ACT, and GCT under various hyper-parameters where the shape parameter varies in GGI, and ACT and GCT utilize different Lagrange multipliers.
  • Figure 4: The variation of loss and average EG code length for ACT, GCT, compared to conventional training.