Dynamics of Transient Structure in In-Context Linear Regression Transformers
Liam Carroll, Jesse Hoogland, Matthew Farrugia-Roberts, Daniel Murfet
TL;DR
The paper investigates transient structure in transformers trained on in-context linear regression across varying task diversity $M$, showing a progression from ridge-like generalization to dMMSE memorization. It combines joint trajectory PCA to reveal a generalization-memorization axis in function space with a Bayesian-inspired loss/complexity tradeoff based on the local learning coefficient (LLC) to explain the dynamics. The authors replicate the transient ridge phenomenon, identify a task-diversity threshold, and provide empirical LLC-based validation linking development to an evolving balance between accuracy and model complexity. These results offer a principled lens—dynamic internal model selection—for understanding how internal representations in deep networks develop and reorganize under changing data distributions, with implications for generalization and robustness.
Abstract
Modern deep neural networks display striking examples of rich internal computational structure. Uncovering principles governing the development of such structure is a priority for the science of deep learning. In this paper, we explore the transient ridge phenomenon: when transformers are trained on in-context linear regression tasks with intermediate task diversity, they initially behave like ridge regression before specializing to the tasks in their training distribution. This transition from a general solution to a specialized solution is revealed by joint trajectory principal component analysis. Further, we draw on the theory of Bayesian internal model selection to suggest a general explanation for the phenomena of transient structure in transformers, based on an evolving tradeoff between loss and complexity. We empirically validate this explanation by measuring the model complexity of our transformers as defined by the local learning coefficient.
