Table of Contents
Fetching ...

Establishing a Scale for Kullback--Leibler Divergence in Language Models Across Various Settings

Ryo Kishino, Yusuke Takase, Momose Oyama, Hiroaki Yamagiwa, Hidetoshi Shimodaira

TL;DR

This work proposes log-likelihood vectors as a unified coordinate system (model map) to compare language models across diverse settings, including pretraining, seeds, model sizes, quantization, fine-tuning, and layer depth. It establishes an interpretable KL-divergence scale in bits per byte and shows that log-likelihood trajectories are substantially more stable than weight trajectories, exhibiting subdiffusive dynamics and early stabilization even as weights drift. Quantitative analyses across Pythia pretraining, 8-bit and 4-bit quantization, fine-tuning, and layer-wise studies reveal characteristic KL scales and structured perturbations, enabling meaningful cross-model comparisons. The findings suggest a folding mechanism in the mapping from weight space to log-likelihood space and provide a practical framework for trajectory analysis and model comparison with potential implications for understanding learning dynamics and robustness in large language models.

Abstract

Log-likelihood vectors define a common space for comparing language models as probability distributions, enabling unified comparisons across heterogeneous settings. We extend this framework to training checkpoints and intermediate layers, and establish a consistent scale for KL divergence across pretraining, model size, random seeds, quantization, fine-tuning, and layers. Analysis of Pythia pretraining trajectories further shows that changes in log-likelihood space are much smaller than in weight space, resulting in subdiffusive learning trajectories and early stabilization of language-model behavior despite weight drift.

Establishing a Scale for Kullback--Leibler Divergence in Language Models Across Various Settings

TL;DR

This work proposes log-likelihood vectors as a unified coordinate system (model map) to compare language models across diverse settings, including pretraining, seeds, model sizes, quantization, fine-tuning, and layer depth. It establishes an interpretable KL-divergence scale in bits per byte and shows that log-likelihood trajectories are substantially more stable than weight trajectories, exhibiting subdiffusive dynamics and early stabilization even as weights drift. Quantitative analyses across Pythia pretraining, 8-bit and 4-bit quantization, fine-tuning, and layer-wise studies reveal characteristic KL scales and structured perturbations, enabling meaningful cross-model comparisons. The findings suggest a folding mechanism in the mapping from weight space to log-likelihood space and provide a practical framework for trajectory analysis and model comparison with potential implications for understanding learning dynamics and robustness in large language models.

Abstract

Log-likelihood vectors define a common space for comparing language models as probability distributions, enabling unified comparisons across heterogeneous settings. We extend this framework to training checkpoints and intermediate layers, and establish a consistent scale for KL divergence across pretraining, model size, random seeds, quantization, fine-tuning, and layers. Analysis of Pythia pretraining trajectories further shows that changes in log-likelihood space are much smaller than in weight space, resulting in subdiffusive learning trajectories and early stabilization of language-model behavior despite weight drift.

Paper Structure

This paper contains 50 sections, 9 equations, 19 figures, 3 tables.

Figures (19)

  • Figure 1: Visualization of the pretraining trajectories of Pythia models across five model sizes and seven random seeds using t-SNE (perplexity=30) applied to log-likelihood vectors. Each point represents a model checkpoint, with lighter colors indicating earlier steps. The thickness of the connecting lines reflects the magnitude of KL divergence between successive checkpoints. See Appendix \ref{['app:visual_pretraining']} for additional visual analyses.
  • Figure 2: (a) t-SNE visualization (perplexity=30) of the original models and their 8-bit quantized models. (b) t-SNE visualization (perplexity=20) of the base models and their fine-tuned models. Lines represent parent-child relationships. Generation refers to the depth from the root model within each lineage. (c) t-SNE visualization (perplexity=30) of the trajectories across layers for 50 models, obtained by applying the logit lens to each layer. Deeper layers are represented with lighter colors, and the final layer is indicated by a white square.
  • Figure 3: Scale of KL divergence (bits per byte) across various settings. The first four entries from the left correspond to pretraining-related comparisons, showing the KL divergence between consecutive saved checkpoints at 1k-step intervals during the last 50k steps of training, at 1k-step intervals during the first 50k steps, across different random seeds at the final checkpoint, and across different model sizes. The subsequent entries show the KL divergence before and after quantization (8-bit and 4-bit), before and after fine-tuning, between randomly selected pairs within the same model type, between randomly selected pairs across all model types, and between adjacent layers. The vertical axis is shown on a symlog scale. See Appendix \ref{['app:kl_table']} and Table \ref{['tab:kl']} for more details.
  • Figure 4: KL divergence between consecutive saved checkpoints of Pythia during pretraining. (Left) Warmup phase with non-uniform checkpoint spacing. (Right) Post-warmup phase with checkpoints saved every 1k training steps.
  • Figure 5: (Left) Temporal evolution of the squared Euclidean distance in the weights and log-likelihood vectors from steps 120k to 130k for Pythia. (Right) Diffusion exponent as a function of the starting point $t_0$. The exponent is estimated using the least squares method.
  • ...and 14 more figures