Table of Contents
Fetching ...

Hidden Dynamics of Massive Activations in Transformer Training

Jorge Gallego-Feliciano, S. Aaron McClendon, Juan Morinelli, Stavros Zervoudakis, Antonios Saravanos

TL;DR

This work demonstrates that the emergence of massive activations is governed by model design and can be anticipated, and potentially controlled, before training begins, and develops a machine learning framework to predict these mathematical parameters from architectural specifications alone.

Abstract

We present the first comprehensive analysis of massive activation development throughout transformer training, using the Pythia model family as our testbed, and release our full dataset publicly to support further research. Through systematic analysis of various model sizes across multiple training checkpoints, we demonstrate that massive activation emergence follows highly predictable mathematical patterns that can be accurately modeled using an exponentially-modulated logarithmic function with five key parameters. Additionally, We develop a machine learning framework to predict these mathematical parameters from architectural specifications alone, achieving high accuracy for steady-state behavior and moderate accuracy for emergence timing and magnitude. These findings enable architects to predict and potentially control key aspects of massive activation emergence through design choices, with significant implications for model stability, training cycle length, interpretability, and optimization. Our findings demonstrate that the emergence of massive activations is governed by model design and can be anticipated, and potentially controlled, before training begins. Code is available at https://github.com/Aimpoint-Digital/massive-activations-fork

Hidden Dynamics of Massive Activations in Transformer Training

TL;DR

This work demonstrates that the emergence of massive activations is governed by model design and can be anticipated, and potentially controlled, before training begins, and develops a machine learning framework to predict these mathematical parameters from architectural specifications alone.

Abstract

We present the first comprehensive analysis of massive activation development throughout transformer training, using the Pythia model family as our testbed, and release our full dataset publicly to support further research. Through systematic analysis of various model sizes across multiple training checkpoints, we demonstrate that massive activation emergence follows highly predictable mathematical patterns that can be accurately modeled using an exponentially-modulated logarithmic function with five key parameters. Additionally, We develop a machine learning framework to predict these mathematical parameters from architectural specifications alone, achieving high accuracy for steady-state behavior and moderate accuracy for emergence timing and magnitude. These findings enable architects to predict and potentially control key aspects of massive activation emergence through design choices, with significant implications for model stability, training cycle length, interpretability, and optimization. Our findings demonstrate that the emergence of massive activations is governed by model design and can be anticipated, and potentially controlled, before training begins. Code is available at https://github.com/Aimpoint-Digital/massive-activations-fork

Paper Structure

This paper contains 2 sections, 19 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Example analysis of massive activation development in Pythia 1B. a) Top 3 activations and median for each layer before and after training, showing that MAs develop during training. b) Evolution of the top three to median activation ratios during training for two example layers. c) 5-parameter model fits the evolution of MA with an $R^2 > 0.99$ over all layers -- here shown 2 fits.
  • Figure 2: Plot of transformer parameter count vs value of the top activation to median ratio per model, in each respective final model checkpoint.
  • Figure 3: Top activation magnitudes per layer in models Pythia-14M, Pythia-1.4B and Pythia-12B at revision step 0 and 143000, which correspond to the start and end of training. Pythia-14M reaches a top 1 to median ratio of 83, Pythia-1.4B reaches 2350, and Pythia-12B reaches 3200.
  • Figure 4: Evolution of the ratio of top activations to median (Equation \ref{['eq:ratio']}) during training for Pythia 1B. It is a linear interpolation of 37 data points corresponding to different training checkpoints. Apart from the highest activation which is the focus of our study, we also plot ratios corresponding to the top 2 and 3 for comparison. The plots show the training steps on the x-axis, and the ratio of the top magnitudes to median activations on the y-axis.
  • Figure 5: Heatmaps showing the location and magnitude of peak MAs by layer depth and model size. Training ends at 143k for the Pythia family, so the yellow middle layers in \ref{['fig:peak-analysis-x']} show that MAs would continue to rise monotonically if training continued, where as the darker layers, generally shallow and deep layers, peak and start decreasing before training ends.
  • ...and 5 more figures