Establishing a Scale for Kullback--Leibler Divergence in Language Models Across Various Settings
Ryo Kishino, Yusuke Takase, Momose Oyama, Hiroaki Yamagiwa, Hidetoshi Shimodaira
TL;DR
This work proposes log-likelihood vectors as a unified coordinate system (model map) to compare language models across diverse settings, including pretraining, seeds, model sizes, quantization, fine-tuning, and layer depth. It establishes an interpretable KL-divergence scale in bits per byte and shows that log-likelihood trajectories are substantially more stable than weight trajectories, exhibiting subdiffusive dynamics and early stabilization even as weights drift. Quantitative analyses across Pythia pretraining, 8-bit and 4-bit quantization, fine-tuning, and layer-wise studies reveal characteristic KL scales and structured perturbations, enabling meaningful cross-model comparisons. The findings suggest a folding mechanism in the mapping from weight space to log-likelihood space and provide a practical framework for trajectory analysis and model comparison with potential implications for understanding learning dynamics and robustness in large language models.
Abstract
Log-likelihood vectors define a common space for comparing language models as probability distributions, enabling unified comparisons across heterogeneous settings. We extend this framework to training checkpoints and intermediate layers, and establish a consistent scale for KL divergence across pretraining, model size, random seeds, quantization, fine-tuning, and layers. Analysis of Pythia pretraining trajectories further shows that changes in log-likelihood space are much smaller than in weight space, resulting in subdiffusive learning trajectories and early stabilization of language-model behavior despite weight drift.
