Tending Towards Stability: Convergence Challenges in Small Language Models

Richard Diehl Martinez; Pietro Lesci; Paula Buttery

Tending Towards Stability: Convergence Challenges in Small Language Models

Richard Diehl Martinez, Pietro Lesci, Paula Buttery

TL;DR

This work investigates the convergence of the Attention and MLP activations to their final state and examines how the effective rank of their parameters influences this process, finding that nearly all layers in larger models stabilise early in training whereas layers in smaller models exhibit slower and less stable convergence.

Abstract

Increasing the number of parameters in language models is a common strategy to enhance their performance. However, smaller language models remain valuable due to their lower operational costs. Despite their advantages, smaller models frequently underperform compared to their larger counterparts, even when provided with equivalent data and computational resources. Specifically, their performance tends to degrade in the late pretraining phase. This is anecdotally attributed to their reduced representational capacity. Yet, the exact causes of this performance degradation remain unclear. We use the Pythia model suite to analyse the training dynamics that underlie this phenomenon. Across different model sizes, we investigate the convergence of the Attention and MLP activations to their final state and examine how the effective rank of their parameters influences this process. We find that nearly all layers in larger models stabilise early in training - within the first 20% - whereas layers in smaller models exhibit slower and less stable convergence, especially when their parameters have lower effective rank. By linking the convergence of layers' activations to their parameters' effective rank, our analyses can guide future work to address inefficiencies in the learning dynamics of small models.

Tending Towards Stability: Convergence Challenges in Small Language Models

TL;DR

Abstract

Paper Structure

This paper contains 1 section, 1 figure.

Introduction

Figures (1)

Figure 1: $\mathrm{CKA}\xspace$ similarity (current vs. last checkpoint) of $\mathrm{Attention}$ and $\mathrm{MLP}$ activations for Pythia [mode=math]160 and [mode=math]2.8. Distribution across layers: 10 , 25 , 50 , 75 , and 90 -th percentiles per checkpoint. Scaling the number of parameters in language models (LMs) has provided impressive performance gains on a variety of tasks hendrycks2020measuring and has become the de facto standard to make progress in model design chowdhery2023palm. Small LMs, however, remain essential as they are more practical: lower training and inference costs result in a smaller environmental impact schwartz2020greenai. Small LMs empower individuals to train on proprietary data by requiring fewer resources, enhancing data privacy huang2022large and democratising access to language modelling technology bender2021dangers. However, for the same data and computational budget, small LMs (unsurprisingly) underperform larger ones biderman-etal-2023-pythia and (importantly) their performance tends to degrade in the late pretraining phase, a phenomenon termed saturation by godey2024small. Saturation is typically attributed to the "limited representational capacity" of small LMs; besides this anectodal justification, our understanding of its causes is still limited. Recently, godey2024small linked saturation to the reduced variability of the output embeddings of LMs caused by the mismatch between the hidden model dimension and the vocabulary size yang2018breaking. Specifically, the last layer of LMs maps the hidden representation of random tokens to output embeddings with high cosine similarity. In this paper, we use the Pythia model suite biderman-etal-2023-pythia to provide orthogonal analyses that consider models' training dynamics. First, we study how the activations of the $\mathrm{Attention}$ and $\mathrm{MLP}$ layers converge to their final state across LMs of different sizes. Then, we relate the difference in convergence behaviour across sizes to the effective rank of their parameters: layers whose activations converge later in training span a smaller fraction of their dimensions. Specifically, we first use the Centered Kernel Alignmentkornblith2019similarity metric to measure the similarity of layers' activations across checkpoints. We observe that larger LMs converge faster and more smoothly to their final state. As shown in \ref{['fig:cka_main_plot']}, within the first [mode=math]20 of training nearly all layers in the larger LM ( [mode=math]2.8) resemble their final state, while most layers in the smaller LM ( [mode=math]160) remain different for most of training. We then find a strong correlation between the convergence pattern of a layer's activations and the rank of its parameters and gradients. We introduce the concept of proportional effective rank (\ref{['sec:methodology']}) to consistently compare these effective ranks across model sizes. Our analyses highlight training inefficiencies in small-scale LMs, paving the way for targeted improvements in future work. Prior work has studied various learning dynamics of the Pythia suite, including memorisation biderman-etal-2023-pythialesci-etal-2024-causal, training data influence liu2024training, and statistics of learned embeddings belrose2024neural. Related to our work, godey2024small examine the differences in the rank of the unembedding matrix (mapping from hidden representations to tokens) across model sizes, known as the softmax bottleneck yang2018breaking. Unlike their findings, we focus on the convergence dynamics of all layers. Similarity metrics like $\mathrm{CKA}\xspace$ and Singular Vector Canonical Correlation Analysis (SVCCA) are widely used to analyse language model properties. nguyen2020wide find that architectural decisions, such as model width and depth, affect hidden representation similarity. wu2020similarity show that models within the same architectural family share similar hidden structures, a similarity that persists even in fine-tuned models phang2021fine. Additionally, SVCCA has been used to study token representation distribution in multilingual models singh2019bert and syntactic element learning in monolingual models saphra2019understanding. Most similar to our work, brown2023understanding use representation similarity metrics, including $\mathrm{CKA}\xspace$, to study Pythia generalisation capabilities. However, our study is the first to use the $\mathrm{CKA}\xspace$ metric to examine the convergence dynamics of layers' activations across model sizes.We first describe the residual stream view of transformer-based models and define layers' activations. Then, we introduce the $\mathrm{CKA}\xspace$ and proportional effective rank metrics. The residual stream view of the transformer architecture vaswani-etal-2017-attention is an analytical framework to study how information flows through its layers elhage-etal-2021-mathematical. This conceptualisation focuses on the residual connections as they provide a direct reference to the inputs. Specifically, the set of residual connections across layers is termed the residual stream. Each layer can be seen as providing modifications to the residual stream via addition operations. Layers have two main components, $\mathrm{Attention}$ and $\mathrm{MLP}$, that sequentially update the residual stream. Formally, a sequence of $T$ tokens $\boldsymbol{\mathrm{t}} = \langle t_1, ..., t_T\rangle$ is first converted into a matrix $\boldsymbol{x}_0 \mathop{\in} \mathbb{R}^{\mathop{T\times D}}$ by the embedding layer: each column is a token representation of size $D$. Then, each layer $l\mathop{\in}\{1,..., L\}$ updates these representations as follows: \linenomathNonumbers \boldsymbol{x}'= \boldsymbol{x}_{l-1} + \dashuline{\mathrm{Attention}(\boldsymbol{x}_{l-1})}\boldsymbol{x}_l= \boldsymbol{x}' + \dashuline{\mathrm{MLP}(\boldsymbol{x}')}Finally, the $T$-th column of $\boldsymbol{x}_L$ is used to predict the $(T\mathop{+}1)$-th token. More details in \ref{['app:residual-stream']}.The updates to the residual stream---underlined in \ref{['eq:residual_stream']}---are the layer's activations and have the same dimensions as the residual stream, i.e., $\mathbb{R}^{\mathop{T\times D}}$. Both $\mathrm{Attention}$ and $\mathrm{MLP}$ first project, or "read", the residual stream into lower-dimensional intermediate representations; then project these representations back, or "write", into the residual stream. Here, we study the behaviour of the parameters that write to the residual stream. We use $\boldsymbol{a}^{\mathtt{ATT}}$ and $\boldsymbol{a}^{\mathtt{MLP}}$ to denote the activations and $\boldsymbol{\theta}^{\mathtt{ATT}}$ and $\boldsymbol{\theta}^{\mathtt{MLP}}$ to denote the parameters of, respectively, $\mathrm{Attention}$ and $\mathrm{MLP}$.Given a set of activations, either $\boldsymbol{a}^{\mathtt{ATT}}$ or $\boldsymbol{a}^{\mathtt{MLP}}$, of a layer $l$ at a particular checkpoint $c$, $\boldsymbol{a}_{l, c}$, we measure how similar they are to those at the last checkpoint $C$, $\boldsymbol{a}_{l, C}$, using the linear variant of the Centered Kernel Alignment metric kornblith2019similarity: \linenomathNonumbers \mathrm{CKA}\xspace(\overline{\boldsymbol{a}}_{c}, \overline{\boldsymbol{a}}_{C}) = \frac{\left\lVert\overline{\boldsymbol{a}}_{c}{}^{\top}\, \overline{\boldsymbol{a}}_{C}\right\rVert^2_F}{\left\lVert\overline{\boldsymbol{a}}_{c}{}^{\top}\,\overline{\boldsymbol{a}}_{c}\right\rVert_F\; \;\left\lVert\overline{\boldsymbol{a}}_{C}{}^{\top}\,\overline{\boldsymbol{a}}_{C}\right\rVert_F}where $\overline{\boldsymbol{a}}$ denotes the centred activations, and $\left\lVert\cdot\right\rVert_F$ is the Frobenius norm; we omit the layer subscript $l$ for clarity. We compute \ref{['eq:cka']} for both $\boldsymbol{a}^{\mathtt{ATT}}$ and $\boldsymbol{a}^{\mathtt{MLP}}$ across all layers and checkpoints throughout training, allowing us to examine the convergence dynamics of each layer's activations.Let $H$ be the dimension of the intermediate representation of either $\mathrm{Attention}$ or $\mathrm{MLP}$. For a layer $l$, let $\boldsymbol{\theta}_{l} \in \mathbb{R}^{\mathop{D\times H}}$ be the subset of parameters of either $\boldsymbol{\theta}^{\mathtt{ATT}}$ or $\boldsymbol{\theta}^{\mathtt{MLP}}$ that comprise the matrix that projects from the hidden space into the residual stream. We measure the effective number of dimensions onto which $\boldsymbol{\theta}_{l}$ projects the intermediate representations using the definition of effective rank introduced in roy-vetterli-2007-effective. The effective rank is computed as the entropy over the normalised singular values of the parameter matrix $\boldsymbol{\theta}_{l}$, that is: \linenomathNonumbers \mathrm{ER}(\boldsymbol{\theta}_{l}) = \exp \left( -\sum_{k=1}^K \frac{\sigma_k}{\left\lVert\sigma\right\rVert_1} \; \log \frac{\sigma_k}{\left\lVert\sigma\right\rVert_1}\right)where $\sigma = \langle\sigma_1, ..., \sigma_K\rangle$ is the vector of singular values and $\left\lVert\cdot\right\rVert_1$ is the $\ell_1$ norm. In this paper, we introduce the notion of a proportional effective rank ($\mathrm{PER}$) computed as the effective rank normalised by the number of hidden dimensions: \linenomathNonumbers \mathrm{PER}(\boldsymbol{\theta}_{l}) = \mathrm{ER}(\boldsymbol{\theta}_{l}) \, / \, HThe $\mathrm{PER}$ allows us to compare the effective rank of layers with different sizes consistently. We compute the $\mathrm{PER}$ of both $\boldsymbol{\theta}^{\mathtt{ATT}}$ and $\boldsymbol{\theta}^{\mathtt{MLP}}$, as well as the gradients of these parameters, across all layers and checkpoints throughout training. $\mathrm{CKA}\xspace$ similarity (current vs. last checkpoint) of layers' activations (first column), $\mathrm{PER}$ of layers' parameters (second column) and gradients (third column) for $\mathrm{Attention}$ (top row) and $\mathrm{MLP}$ (bottom row) in Pythia [mode=math]70, [mode=math]160, [mode=math]410, [mode=math]1.4, and [mode=math]2.8 averaged (mean) across layers per each checkpoint. We use the Pythia model suite biderman-etal-2023-pythia, composed of 8 transformers of different sizes trained for [mode=math]143 steps on the deduplicated version of the Pile dataset gao-etal-2020-pilebiderman-etal-2022-datasheet. Intermediate checkpoints are available every [mode=math]1 steps and at log-spaced intervals early in training. To comply with our computational budget, we consider models up to [mode=math]2.8 parameters---i.e., [mode=math]70, [mode=math]160, [mode=math]410, [mode=math]1.4, and [mode=math]2.8---evaluated at the following steps: 0 , all log-spaced steps $\{1, 2, 4, ..., 512\}$, [mode=math]1, [mode=math]3, and then every [mode=math]10 steps up to [mode=math]143. We evaluate each checkpoint on the last batch of the training set and collect its activations. More details in \ref{['app:implementation_details']}.Our analyses reveal quantitative differences in the learning dynamics of layers across model sizes. As observed in \ref{['fig:main-results']} (first column), larger models show, on average, earlier convergence of $\mathrm{Attention}$ and $\mathrm{MLP}$ activations. For example, by [mode=math]20 of training, the $\mathrm{CKA}\xspace$ score in [mode=math]2.8 is 0.8 for $\mathrm{MLP}$ and 0.7 for $\mathrm{Attention}$, where in [mode=math]70 and [mode=math]160 it is around 0.5 . This fast convergence pattern holds across layers, as shown by the distributions in \ref{['fig:cka_main_plot']}. Across model sizes, earlier layers' activations converge faster to their final state than those of later layers. As shown in \ref{['fig:cka-layer-wise-lines']} (\ref{['app:layerwise-convergence-figures']}), the faster average convergence in larger models is due to more of their later layers converging earlier, whereas smaller models' layers only reach their final state towards the end of training.Based on recent work that identifies parameter rank differences across model sizes godey2024small, in the next paragraphs, we study whether the different convergence behaviours are related to the effective rank of layers' parameters and gradients. Parameters in layers of larger models span a slightly larger fraction of their available dimensions compared to smaller models, as shown in \ref{['fig:main-results']} (second column). Moreover, the $\mathrm{PER}$ of larger models stabilises early, while it keeps decreasing throughout training for smaller ones. This finding is further underscored when visualising the $\mathrm{PER}$ for each layer, as shown in \ref{['fig:per_weight-layer-wise-lines']} (\ref{['app:layerwise-per_weight-figures']}); we observe that in smaller models the $\mathrm{PER}$ of later layers tends to decrease over the course of training, while in larger models the $\mathrm{PER}$ of all layers stabilises early in training. This difference is even more pronounced in the $\mathrm{PER}$ of these layers' gradients, as shown in \ref{['fig:main-results']} (third column).The $\mathrm{PER}$ of gradients reflects the proportion of the learning signal transmitted by the gradients relative to the available parameter dimensions. In \ref{['fig:main-results']} (third column), we observe that throughout training gradients in larger models consistently span a larger fraction of the available dimensions, with this fraction gradually decreasing over time. In contrast, smaller models display more variability. At first glance, the averaged $\mathrm{PER}$ of gradients in the $\mathrm{Attention}$ layer of the [mode=math]2.8 model might appear to contradict the observed trend. However, this discrepancy is clarified when examining the $\mathrm{PER}$ of gradients across individual layers, as shown in \ref{['fig:per_grad-layer-wise-lines']} (\ref{['app:layerwise-per_grad-figures']}). Once again, we observe that the $\mathrm{PER}$ of gradients in later layers of smaller models are less stable compared to larger models. The reason the average $\mathrm{PER}$ of gradients in the $\mathrm{Attention}$ layer of the [mode=math]2.8 model is smaller than in smaller models is that, early in training, all layers of the larger model stabilise at their final values. At this stage, the stabilised layers of the larger model have lower gradient $\mathrm{PER}$ values compared to those of smaller models, which have not yet converged. Overall, our findings suggest that layers in larger models converge both more quickly and tend to receive proportionally larger rank updates during training. We investigate the correlation between a layer’s activations convergence rate and the rank of its parameters and gradients. Broadly, we find that layers with higher effective rank in both weights and gradients converge faster. To measure this correlation, we first create two binary variables for each layer indicating whether (i) it converges early in training and (ii) maintains a stable $\mathrm{PER}$ throughout training. Then, we calculate the Matthew's Correlation Coefficient between these two statistics across layers and report them in \ref{['tab:model_correlation']}. Specifically, for each layer of a given model, we determine whether that layer exhibits early activations' convergence and large and stable parameters' and gradients' $\mathrm{PER}$s (relative to other model layers) using the following heuristics: Early activations' convergence. Activations' $\mathrm{CKA}\xspace \mathop{\geq} 0.45 \xspace$ by the first [mode=math]10 of training (applies to both the $\mathrm{Attention}$ and $\mathrm{MLP}$ layers).Large parameters' $\mathrm{PER}$. Parameters' $\mathrm{PER} \mathop{\geq} 0.95 \xspace$ by the end of training (applies to both the $\mathrm{Attention}$ and $\mathrm{MLP}$ layers).Large gradients' $\mathrm{PER}$. We note that gradients' $\mathrm{PER}$ slightly decreases throughout training for each model size. Rather than choosing a fixed value to determine large and stable gradients' $\mathrm{PER}$s, we dynamically set the threshold at [mode=math]90 of the largest $\mathrm{PER}$ attained by any layer at the end of training.We observe a strong correlation for the $\mathrm{Attention}$ layers across model sizes. For the $\mathrm{MLP}$ layers, the correlation with the gradients' $\mathrm{PER}$ is strong for models up to [mode=math]1.4, while the correlation with the parameters' $\mathrm{PER}$ is strong only for the [mode=math]70 model. We hypothesise that this discrepancy can be explained by the fact that $\mathrm{MLP}$ layers have a large $\mathrm{PER}$ throughout training across all model sizes, apart from those of the [mode=math]70 model. While these results are correlational, they provide a foundation for future work to test whether methods that specifically increase the PER of layers' parameters and gradients induce faster convergence of the layers' activations in small models. Size$\boldsymbol{\theta}^{\mathtt{ATT}}$$\nabla \boldsymbol{\theta}^{\mathtt{ATT}}$$\boldsymbol{\theta}^{\mathtt{MLP}}$$\nabla\boldsymbol{\theta}^{\mathtt{MLP}}$[mode=math]701.001.000.631.00 [mode=math]1601.000.840.360.71 [mode=math]4100.840.920.190.78 [mode=math]1.40.780.840.210.64 [mode=math]2.80.730.520.110.18 Matthew's Correlation Coefficient between binary variables indicating whether a given layer converges early in training and whether it maintains a stable PER of the parameters ($\boldsymbol{\theta}$) and gradients ($\nabla\boldsymbol{\theta}$) throughout training for both $\mathrm{Attention}$ and $\mathrm{MLP}$.Our study highlights disparities in the learning dynamics of small and large LMs. Using the Pythia model suite, we demonstrate that layers' activations in larger models converge faster and more monotonically to their final state. We correlate this phenomenon with the larger $\mathrm{PER}$ in the parameters and gradients of larger models. Our analyses expand our understanding of training inefficiencies in small models and provide insights for future work to address them, e.g., by developing methods that increase the $\mathrm{PER}$ of layers’ parameters. Our work is part of a greater effort in Green AI schwartz2020greenai to lower the environmental footprint of training and using language models. We acknowledge, however, that small language models are prone to the same types of biases as large language models that are encoded through the data the models are trained on; the Pile is known to contain gender and racial biases gao-etal-2020-pile. We experiment only with the Pythia model suite and the Pile dataset. It is unclear to what extent our findings translate to other models and datasets (including datasets in languages other than English). Moreover, because of our restricted computational budget, we are limited in our ability to thoroughly study larger language models. The largest models we experiment with are still relatively small given the scale of currently available open-source large language models (in the hundreds of billions). Finally, the relationship we find between the $\mathrm{CKA}\xspace$ similarity scores and the proportional effective rank is purely correlational: in future work, we aim to use our results to guide targeted interventions to assess whether the relationship we found is causal, i.e. whether increasing the effective rank of a layer can increase its convergence speed. We thank the anonymous reviewers for their helpful comments, which helped us improve the paper. The experiments reported in this paper were performed using resources provided by the Cambridge Service for Data Driven Discovery (CSD3) operated by the University of Cambridge Research Computing Service, provided by Dell EMC and Intel using Tier-2 funding from the Engineering and Physical Sciences Research Council (capital grant EP/T022159/1), and DiRAC funding from the Science and Technology Facilities Council. Richard Diehl Martinez is supported by the Gates Cambridge Trust (grant OPP1144 from the Bill & Melinda Gates Foundation). Pietro received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 Research and Innovation programme grant AVeriTeC (Grant agreement No. 865958). @inproceedings{roy-vetterli-2007-effective, author={Roy, Olivier and Vetterli, Martin}, booktitle={15th European Signal Processing Conference}, title={The effective rank: A measure of effective dimensionality}, year={2007}, volume={}, number={}, pages={606-610}, month={sep}, address={Poznan, Poland}, url={https://www.eurasip.org/Proceedings/Eusipco/Eusipco2007/Papers/a5p-h05.pdf} }@inproceedings{yang2018breaking, title={Breaking the Softmax Bottleneck: A High-Rank RNN Language Model}, author={Zhilin Yang and Zihang Dai and Ruslan Salakhutdinov and William W. Cohen}, booktitle={International Conference on Learning Representations}, year={2018}, url={https://openreview.net/forum?id=HkwZSG-CZ} }@inproceedings{ethayarajh-2019-contextual, title={How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings}, author={Ethayarajh, Kawin}, editor={Inui, Kentaro and Jiang, Jing and Ng, Vincent and Wan, Xiaojun}, booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)}, year={2019}, address={Hong Kong, China}, publisher={Association for Computational Linguistics}, url={https://aclanthology.org/D19-1006}, doi={10.18653/v1/D19-1006}, pages={55--65}, abstract={Replacing static word embeddings with contextualized word representations has yielded significant improvements on many NLP tasks. However, just how contextual are the contextualized representations produced by models such as ELMo and BERT? Are there infinitely many context-specific representations for each word, or are words essentially assigned one of a finite number of word-sense representations? For one, we find that the contextualized representations of all words are not isotropic in any layer of the contextualizing model. While representations of the same word in different contexts still have a greater cosine similarity than those of two different words, this self-similarity is much lower in upper layers. This suggests that upper layers of contextualizing models produce more context-specific representations, much like how upper layers of LSTMs produce more task-specific representations. In all layers of ELMo, BERT, and GPT-2, on average, less than 5% of the variance in a word's contextualized representations can be explained by a static embedding for that word, providing some justification for the success of contextualized representations.} }@article{elhage-etal-2021-mathematical, title={A Mathematical Framework for Transformer Circuits}, author={Elhage, Nelson and Nanda, Neel and Olsson, Catherine and Henighan, Tom and Joseph, Nicholas and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and DasSarma, Nova and Drain, Dawn and Ganguli, Deep and Hatfield-Dodds, Zac and Hernandez, Danny and Jones, Andy and Kernion, Jackson and Lovitt, Liane and Ndousse, Kamal and Amodei, Dario and Brown, Tom and Clark, Jack and Kaplan, Jared and McCandlish, Sam and Olah, Chris}, year={2021}, journal={Transformer Circuits Thread}, url={https://transformer-circuits.pub/2021/framework/index.html} }@inproceedings{vaswani-etal-2017-attention, title={Attention Is All You Need}, booktitle={Advances in Neural Information Processing Systems}, author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Łukasz and Polosukhin, Illia}, year={2017}, volume={30}, publisher={Curran Associates, Inc.}, url={https://papers.nips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html} }@inproceedings{paszke-etal-2019-pytorch, title={PyTorch: An Imperative Style, High-Performance Deep Learning Library}, author={Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Kopf, Andreas and Yang, Edward and DeVito, Zachary and Raison, Martin and Tejani, Alykhan and Chilamkurthy, Sasank and Steiner, Benoit and Fang, Lu and Bai, Junjie and Chintala, Soumith}, booktitle={Advances in Neural Information Processing Systems}, publisher={Curran Associates, Inc.}, url={https://proceedings.neurips.cc/paper_files/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html} }@inproceedings{wolf-etal-2020-transformers, title={Transformers: State-of-the-Art Natural Language Processing}, author={Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, Remi and Funtowicz, Morgan and Davison, Joe and Shleifer, Sam and von Platen, Patrick and Ma, Clara and Jernite, Yacine and Plu, Julien and Xu, Canwen and Le Scao, Teven and Gugger, Sylvain and Drame, Mariama and Lhoest, Quentin and Rush, Alexander}, booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations}, publisher={Association for Computational Linguistics}, address={Online}, pages={38--45}, doi={10.18653/v1/2020.emnlp-demos.6}, url={https://aclanthology.org/2020.emnlp-demos.6} }@article{biderman-etal-2022-datasheet, title={Datasheet for the Pile}, author={Biderman, Stella and Bicheno, Kieran and Gao, Leo}, publisher={arXiv}, doi={10.48550/arXiv.2201.07311}, url={http://arxiv.org/abs/2201.07311}, journal={arXiv preprint 2201.07311} }@article{gao-etal-2020-pile, title={The Pile: An 800GB Dataset of Diverse Text for Language Modeling}, author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor}, journal={arXiv preprint 2101.00027}, publisher={arXiv}, url={http://arxiv.org/abs/2101.00027} }@article{belrose2024neural, title={Neural Networks Learn Statistics of Increasing Complexity}, author={Belrose, Nora and Pope, Quintin and Quirke, Lucia and Mallen, Alex and Fern, Xiaoli}, journal={arXiv preprint 2402.04362}, url={https://arxiv.org/abs/2402.04362} }@inproceedings{biderman-etal-2023-pythia, title={Pythia: A suite for analyzing large language models across training and scaling}, author={Biderman, Stella and Schoelkopf, Hailey and Anthony, Quentin and Bradley, Herbie and O'Brien, Kyle and Hallahan, Eric and Khan, Mohammad Aflah and Purohit, Shivanshu and Prashanth, USVSN Sai and Raff, Edward and Skowron, Aviya and Sutawika, Lintang and Van Der Wal, Oskar}, booktitle={Proceedings of the 40th International Conference on Machine Learning}, location={Honolulu, Hawaii, USA}, series={ICML'23}, url={https://proceedings.mlr.press/v202/biderman23a/biderman23a.pdf} }@inproceedings{lesci-etal-2024-causal, title={Causal Estimation of Memorisation Profiles}, author={Lesci, Pietro and Meister, Clara and Hofmann, Thomas and Vlachos, Andreas and Pimentel, Tiago}, editor={Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek}, booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, year={2024}, address={Bangkok, Thailand}, publisher={Association for Computational Linguistics}, url={https://aclanthology.org/2024.acl-long.834}, doi={10.18653/v1/2024.acl-long.834}, pages={15616--15635} }@inproceedings{hendrycks2020measuring, title={Measuring Massive Multitask Language Understanding}, author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt}, booktitle={International Conference on Learning Representations}, year={2021}, url={https://openreview.net/forum?id=d7KBjmI3GmQ} }@inproceedings{brown2023understanding, title={Understanding the Inner-workings of Language Models Through Representation Dissimilarity}, author={Brown, Davis and Godfrey, Charles and Konz, Nicholas and Tu, Jonathan and Kvinge, Henry}, editor={Bouamor, Houda and Pino, Juan and Bali, Kalika}, booktitle={Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing}, year={2023}, address={Singapore}, publisher={Association for Computational Linguistics}, url={https://aclanthology.org/2023.emnlp-main.403}, doi={10.18653/v1/2023.emnlp-main.403}, pages={6543--6558} }@article{chowdhery2023palm, author={Aakanksha Chowdhery and Sharan Narang and Jacob Devlin and Maarten Bosma and Gaurav Mishra and Adam Roberts and Paul Barham and Hyung Won Chung and Charles Sutton and Sebastian Gehrmann and Parker Schuh and Kensen Shi and Sasha Tsvyashchenko and Joshua Maynez and Abhishek Rao and Parker Barnes and Yi Tay and Noam Shazeer and Vinodkumar Prabhakaran and Emily Reif and Nan Du and Ben Hutchinson and Reiner Pope and James Bradbury and Jacob Austin and Michael Isard and Guy Gur-Ari and Pengcheng Yin and Toju Duke and Anselm Levskaya and Sanjay Ghemawat and Sunipa Dev and Henryk Michalewski and Xavier Garcia and Vedant Misra and Kevin Robinson and Liam Fedus and Denny Zhou and Daphne Ippolito and David Luan and Hyeontaek Lim and Barret Zoph and Alexander Spiridonov and Ryan Sepassi and David Dohan and Shivani Agrawal and Mark Omernick and Andrew M. Dai and Thanumalayan Sankaranarayana Pillai and Marie Pellat and Aitor Lewkowycz and Erica Moreira and Rewon Child and Oleksandr Polozov and Katherine Lee and Zongwei Zhou and Xuezhi Wang and Brennan Saeta and Mark Diaz and Orhan Firat and Michele Catasta and Jason Wei and Kathy Meier-Hellstern and Douglas Eck and Jeff Dean and Slav Petrov and Noah Fiedel}, title={PaLM: Scaling Language Modeling with Pathways}, journal={Journal of Machine Learning Research}, year={2023}, volume={24}, number={240}, pages={1--113}, url={http://jmlr.org/papers/v24/22-1144.html} }@article{godey2024small, title={Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck}, author={Godey, Nathan and de la Clergerie, Éric and Sagot, Benoıt}, journal={arXiv preprint 2404.07647}, url={https://arxiv.org/abs/2404.07647} }@inproceedings{kornblith2019similarity, title={Similarity of Neural Network Representations Revisited}, author={Kornblith, Simon and Norouzi, Mohammad and Lee, Honglak and Hinton, Geoffrey}, booktitle={Proceedings of the 36th International Conference on Machine Learning}, pages={3519--3529}, year={2019}, editor={Chaudhuri, Kamalika and Salakhutdinov, Ruslan}, volume={97}, series={Proceedings of Machine Learning Research}, month={09--15 Jun}, publisher={PMLR}, pdf={http://proceedings.mlr.press/v97/kornblith19a/kornblith19a.pdf}, url={https://proceedings.mlr.press/v97/kornblith19a.html} }@article{liu2024training, title={On training data influence of GPT models}, author={Liu, Qingyi and Chai, Yekun and Wang, Shuohuan and Sun, Yu and Wang, Keze and Wu, Hua}, journal={arXiv preprint 2404.07840}, url={https://arxiv.org/abs/2404.07840} }@inproceedings{nguyen2020wide, title={Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth}, author={Thao Nguyen and Maithra Raghu and Simon Kornblith}, booktitle={International Conference on Learning Representations}, year={2021}, url={https://openreview.net/forum?id=KJNcAkY8tY4} }@inproceedings{phang2021fine, title={Fine-Tuned Transformers Show Clusters of Similar Representations Across Layers}, author={Phang, Jason and Liu, Haokun and Bowman, Samuel R.}, editor={Bastings, Jasmijn and Belinkov, Yonatan and Dupoux, Emmanuel and Giulianelli, Mario and Hupkes, Dieuwke and Pinter, Yuval and Sajjad, Hassan}, booktitle={Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP}, year={2021}, address={Punta Cana, Dominican Republic}, publisher={Association for Computational Linguistics}, url={https://aclanthology.org/2021.blackboxnlp-1.42}, doi={10.18653/v1/2021.blackboxnlp-1.42}, pages={529--538} }@inproceedings{saphra2019understanding, title={Understanding Learning Dynamics Of Language Models with SVCCA}, author={Saphra, Naomi and Lopez, Adam}, editor={Burstein, Jill and Doran, Christy and Solorio, Thamar}, booktitle={Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)}, year={2019}, address={Minneapolis, Minnesota}, publisher={Association for Computational Linguistics}, url={https://aclanthology.org/N19-1329}, doi={10.18653/v1/N19-1329}, pages={3257--3267} }@inproceedings{singh2019bert, title={BERT is Not an Interlingua and the Bias of Tokenization}, author={Singh, Jasdeep and McCann, Bryan and Socher, Richard and Xiong, Caiming}, editor={Cherry, Colin and Durrett, Greg and Foster, George and Haffari, Reza and Khadivi, Shahram and Peng, Nanyun and Ren, Xiang and Swayamdipta, Swabha}, booktitle={Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019)}, year={2019}, address={Hong Kong, China}, publisher={Association for Computational Linguistics}, url={https://aclanthology.org/D19-6106}, doi={10.18653/v1/D19-6106}, pages={47--55} }@inproceedings{wu2020similarity, title={Similarity Analysis of Contextual Word Representation Models}, author={Wu, John and Belinkov, Yonatan and Sajjad, Hassan and Durrani, Nadir and Dalvi, Fahim and Glass, James}, editor={Jurafsky, Dan and Chai, Joyce and Schluter, Natalie and Tetreault, Joel}, booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics}, year={2020}, address={Online}, publisher={Association for Computational Linguistics}, url={https://aclanthology.org/2020.acl-main.422}, doi={10.18653/v1/2020.acl-main.422}, pages={4638--4655} }@inproceedings{huang2022large, title={Are Large Pre-Trained Language Models Leaking Your Personal Information?}, author={Huang, Jie and Shao, Hanyin and Chang, Kevin Chen-Chuan}, editor={Goldberg, Yoav and Kozareva, Zornitsa and Zhang, Yue}, booktitle={Findings of the Association for Computational Linguistics: EMNLP 2022}, year={2022}, address={Abu Dhabi, United Arab Emirates}, publisher={Association for Computational Linguistics}, url={https://aclanthology.org/2022.findings-emnlp.148}, doi={10.18653/v1/2022.findings-emnlp.148}, pages={2038--2047}, abstract={Are Large Pre-Trained Language Models Leaking Your Personal Information? In this paper, we analyze whether Pre-Trained Language Models (PLMs) are prone to leaking personal information. Specifically, we query PLMs for email addresses with contexts of the email address or prompts containing the owner's name. We find that PLMs do leak personal information due to memorization. However, since the models are weak at association, the risk of specific personal information being extracted by attackers is low. We hope this work could help the community to better understand the privacy risk of PLMs and bring new insights to make PLMs safe.} }@inproceedings{bender2021dangers, author={Bender, Emily M. and Gebru, Timnit and McMillan-Major, Angelina and Shmitchell, Shmargaret}, title={On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?}, year={2021}, isbn={9781450383097}, publisher={Association for Computing Machinery}, address={New York, NY, USA}, url={https://doi.org/10.1145/3442188.3445922}, doi={10.1145/3442188.3445922}, booktitle={Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency}, pages={610–623}, numpages={14}, location={Virtual Event, Canada}, series={FAccT '21} }@article{schwartz2020greenai, author={Schwartz, Roy and Dodge, Jesse and Smith, Noah A. and Etzioni, Oren}, title={Green AI}, year={2020}, issue_date={December 2020}, publisher={Association for Computing Machinery}, address={New York, NY, USA}, volume={63}, number={12}, issn={0001-0782}, url={https://doi.org/10.1145/3381831}, doi={10.1145/3381831}, abstract={Creating efficiency in AI research will decrease its carbon footprint and increase its inclusivity as deep learning study should not require the deepest pockets.}, journal={Commun. ACM}, month={nov}, pages={54–63}, numpages={10} }The residual stream is a mathematical formalization through which to study how transformer models process inputs elhage-etal-2021-mathematical. Under this framework, each of the $L$ layers of a transformer model processes a series of input tokens $\boldsymbol{\mathrm{t}} = \langle t_1, ..., t_T\rangle$ consecutively and communicate the result of their computation for each token to subsequent layers via a residual stream of dimension $D$. The reading, processing, and writing of the residual stream occur independently in each $\mathrm{Attention}$ head via combinations of the query, key, value and output matrices, $W_Q$, $W_K$, $W_V$, $W_O$: The query-key circuit, $W_Q^{\top}W_K$, of the $\mathrm{Attention}$ mechanism controls how the residual stream should be recomposed, and the output circuit, $W_OW_V$, writes to the residual stream an update that is mediated by the query-key circuit. The write operation of each $\mathrm{Attention}$ head is of low rank relative to $D$. After each $\mathrm{Attention}$ head has written to the residual stream, a bottleneck $\mathrm{MLP}$ projection performs a full-rank transformation on the residual stream. Due to their pivotal role in updating the state of the residual stream, our work analyses the learning dynamics of the two operations that write to the residual stream: the output circuit of each head of the $\mathrm{Attention}$ layer---that we refer to as $\mathrm{Attention}$---and the $\mathrm{MLP}$ projection layer---that we denote $\mathrm{MLP}$ for conciseness. We implement all experiments using the PyTorch framework paszke-etal-2019-pytorch. We access the Pythia models through the transformers library wolf-etal-2020-transformers. We use a server with one NVIDIA A100 80GB PCIe, 32 CPUs, and 32 GB of RAM for all experiments. Collecting model activations for all analyses required in total about 24 GPU hours. Below, we report a subset of the output of the lscpu command: We use the publicly available Pythia model suite biderman-etal-2023-pythia, which was trained on the Pile gao-etal-2020-pilebiderman-etal-2022-datasheet. Both the preprocessed training data and intermediate checkpoints are publicly available.

Tending Towards Stability: Convergence Challenges in Small Language Models

TL;DR

Abstract

Tending Towards Stability: Convergence Challenges in Small Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)