Hidden Dynamics of Massive Activations in Transformer Training

Jorge Gallego-Feliciano; S. Aaron McClendon; Juan Morinelli; Stavros Zervoudakis; Antonios Saravanos

Hidden Dynamics of Massive Activations in Transformer Training

Jorge Gallego-Feliciano, S. Aaron McClendon, Juan Morinelli, Stavros Zervoudakis, Antonios Saravanos

TL;DR

This work demonstrates that the emergence of massive activations is governed by model design and can be anticipated, and potentially controlled, before training begins, and develops a machine learning framework to predict these mathematical parameters from architectural specifications alone.

Abstract

We present the first comprehensive analysis of massive activation development throughout transformer training, using the Pythia model family as our testbed, and release our full dataset publicly to support further research. Through systematic analysis of various model sizes across multiple training checkpoints, we demonstrate that massive activation emergence follows highly predictable mathematical patterns that can be accurately modeled using an exponentially-modulated logarithmic function with five key parameters. Additionally, We develop a machine learning framework to predict these mathematical parameters from architectural specifications alone, achieving high accuracy for steady-state behavior and moderate accuracy for emergence timing and magnitude. These findings enable architects to predict and potentially control key aspects of massive activation emergence through design choices, with significant implications for model stability, training cycle length, interpretability, and optimization. Our findings demonstrate that the emergence of massive activations is governed by model design and can be anticipated, and potentially controlled, before training begins. Code is available at https://github.com/Aimpoint-Digital/massive-activations-fork

Hidden Dynamics of Massive Activations in Transformer Training

TL;DR

Abstract

Hidden Dynamics of Massive Activations in Transformer Training

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)