Table of Contents
Fetching ...

The Garden of Forking Paths: Observing Dynamic Parameters Distribution in Large Language Models

Carlo Nicolini, Jacopo Staiano, Bruno Lepri, Raffaele Marino

TL;DR

Looking at the time evolution of the statistic distribution of model parameters, and specifically at bifurcation effects, can help understanding the model quality, potentially reducing training costs and evaluation efforts and empirically showing the reasons behind the effectiveness of weights sparsification.

Abstract

A substantial gap persists in understanding the reasons behind the exceptional performance of the Transformer architecture in NLP. A particularly unexplored area involves the mechanistic description of how the distribution of parameters evolves over time during training. In this work we suggest that looking at the time evolution of the statistic distribution of model parameters, and specifically at bifurcation effects, can help understanding the model quality, potentially reducing training costs and evaluation efforts and empirically showing the reasons behind the effectiveness of weights sparsification.

The Garden of Forking Paths: Observing Dynamic Parameters Distribution in Large Language Models

TL;DR

Looking at the time evolution of the statistic distribution of model parameters, and specifically at bifurcation effects, can help understanding the model quality, potentially reducing training costs and evaluation efforts and empirically showing the reasons behind the effectiveness of weights sparsification.

Abstract

A substantial gap persists in understanding the reasons behind the exceptional performance of the Transformer architecture in NLP. A particularly unexplored area involves the mechanistic description of how the distribution of parameters evolves over time during training. In this work we suggest that looking at the time evolution of the statistic distribution of model parameters, and specifically at bifurcation effects, can help understanding the model quality, potentially reducing training costs and evaluation efforts and empirically showing the reasons behind the effectiveness of weights sparsification.
Paper Structure (13 sections, 5 equations, 7 figures, 1 table)

This paper contains 13 sections, 5 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Panel A. Pythia models' basic architecture. The unembedding layer $\mathbf{W}_U$ is the last green layer. The attention matrix $\mathbf{A}$ is entailed in the red coloured multi-head self attention layer. Panel B. shows the output embedding matrix at first and last training step for the 14M model. For illustration purpose the first 512 out of 50304 columns are shown. Panel C. shows the first layer attention matrix at first and last training step for the 1B model. Panel D shows the average of token logits of a long sentence for the 1B model both at first and last training step.
  • Figure 2: Dynamics of the density of the unembedding layer for four models. On the first row the models trained on NDD dataset, on the bottom row the models trained on DD dataset.
  • Figure 3: Mean square displacement over unembedding layer weights as a function of the training steps. Left: the smallest 70M and 160M on the deduped dataset. Right: the smallest models 14M and 31M on the non-deduped dataset. Vertical dashed lines are shown at the peak $MSD(\tau)$.
  • Figure 4: Perplexity of generated tokens on the first 500 examples of the test set of Lambada dataset. Left: perplexity expressed in linear scale. Right: same plot of perplexity but expressed in logarithmic scale. An horizontal black line is drawn at zero perplexity in both plots.
  • Figure 5: Causal unmasking process. The tokenized sentence is used to generate six new sentences, where the model completes from an initial set of tokens up to the initial phrase number of tokens. At each newly generated sub-sentence the model generates new tokens, depicted in gray.
  • ...and 2 more figures