It Takes a Good Model to Train a Good Model: Generalized Gaussian Priors for Optimized LLMs

Jun Wu; Patrick Huang; Jiangtao Wen; Yuxing Han

It Takes a Good Model to Train a Good Model: Generalized Gaussian Priors for Optimized LLMs

Jun Wu, Patrick Huang, Jiangtao Wen, Yuxing Han

TL;DR

This work empirically shows that weights, activations, and gradients in LLMs are well modeled by generalized Gaussian distributions, and introduces a unified, end-to-end optimization framework grounded in this observation.

Abstract

Despite rapid progress in large language models (LLMs), the statistical structure of their weights, activations, and gradients-and its implications for initialization, training dynamics, and efficiency-remains largely unexplored. We empirically show that these quantities in LLMs are well modeled by generalized Gaussian (GG) distributions, and introduce a unified, end-to-end optimization framework grounded in this observation. Our contributions are threefold: (1) a GG-based initialization that aligns with trained model statistics, accelerating convergence and improving accuracy; (2) ACT, a progressive activation-constrained training method that reduces redundancy and propagation overhead; and (3) GCT, a gradient-constrained training algorithm that substantially lowers communication cost in distributed training. Experiments across diverse architectures demonstrate consistently smaller, faster models with minimal communication overhead that match or surpass standard baselines. By anchoring LLM optimization in principled statistical modeling, this work advances efficient, scalable, and hardware-aware AI systems.

It Takes a Good Model to Train a Good Model: Generalized Gaussian Priors for Optimized LLMs

TL;DR

Abstract

It Takes a Good Model to Train a Good Model: Generalized Gaussian Priors for Optimized LLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)