Adaptation Odyssey in LLMs: Why Does Additional Pretraining Sometimes Fail to Improve?

Fırat Öncel; Matthias Bethge; Beyza Ermis; Mirco Ravanelli; Cem Subakan; Çağatay Yıldız

Adaptation Odyssey in LLMs: Why Does Additional Pretraining Sometimes Fail to Improve?

Fırat Öncel, Matthias Bethge, Beyza Ermis, Mirco Ravanelli, Cem Subakan, Çağatay Yıldız

TL;DR

It is demonstrated that training a model on a text domain could degrade its perplexity on the test portion of the same domain, and it is observed that the performance degradation is positively correlated with the similarity between the additional and the original pretraining dataset of the LLM.

Abstract

In the last decade, the generalization and adaptation abilities of deep learning models were typically evaluated on fixed training and test distributions. Contrary to traditional deep learning, large language models (LLMs) are (i) even more overparameterized, (ii) trained on unlabeled text corpora curated from the Internet with minimal human intervention, and (iii) trained in an online fashion. These stark contrasts prevent researchers from transferring lessons learned on model generalization and adaptation in deep learning contexts to LLMs. To this end, our short paper introduces empirical observations that aim to shed light on further training of already pretrained language models. Specifically, we demonstrate that training a model on a text domain could degrade its perplexity on the test portion of the same domain. We observe with our subsequent analysis that the performance degradation is positively correlated with the similarity between the additional and the original pretraining dataset of the LLM. Our further token-level perplexity observations reveals that the perplexity degradation is due to a handful of tokens that are not informative about the domain. We hope these findings will guide us in determining when to adapt a model vs when to rely on its foundational capabilities.

Adaptation Odyssey in LLMs: Why Does Additional Pretraining Sometimes Fail to Improve?

TL;DR

Abstract

Paper Structure (21 sections, 6 figures, 6 tables)

This paper contains 21 sections, 6 figures, 6 tables.

Introduction
Method
Models and Training
Tasks
Evaluation
Domain Similarity Measures
Maximum Mean Discrepancy (MMD).
Fréchet Distance (FD).
Results
Domain similarity.
What happens during gradient descent?
Token-level observations.
Discussion
Similarity between original corpus and adaptation domain affects the performance.
Adaptation influences smaller models more
...and 6 more sections

Figures (6)

Figure 1: Perplexity change after adaptation (denoted with Zero Shot - Adapted), where - stands for subtraction of perplexities (blue) and similarity measures (orange and green, re-scaled for visualization purposes), plotted against adapted domains ($x$-axis), which are S2ORC (Blue Shaded Area) and Wiki (Orange Shaded Area). Adaptation domain names corresponding to the IDs on the $x$-axis are presented in Appendix \ref{['sec:adaptation-domain']}. Above the black dashed line are the domains for which adaptation improved the test perplexity. Interestingly, we observe a degradation in Wiki domains. When the model capacity increases the gap between zero shot and adaptation becomes smaller.
Figure 2: Domain IDs ($x$ axis). MMD and FD scores between OpenWebText and M2D2 Domains ($y$ axis). Wiki (blue shaded area) portion is closer to source corpora compared to the S2ORC (orange shaded area) portion. All Domain names corresponding to the IDs in $x$ axies are presented in Appendix \ref{['sec:all-domain']}. Same plot for Dolma is presented in Appendix, Figure \ref{['fig:mmd-fid-dolma']}.
Figure 3: The perplexities computed on 4 domains during pretraining. Note that we pretrain only for one epoch, i.e., the first 3% of the training data is never seen again.
Figure 4: The figure on the left presents our token-level analysis of the OLMo-1B model on the train portion of Human activities subdomain of the Wiki corpus. The $x$-axis displays the tokens that exhibit the greatest increase in perplexity after domain adaptation, while the $y$-axis shows the corresponding average degradation in perplexity, which spans orders of magnitude. The figure on the right presents the occurrences of the tokens. Special tokens like "\\ n" and "\\ n\\ n" are the most seen tokens. $y$-axis is in the log scale.
Figure 5: Domain IDs ($x$ axis). MMD and FD scores between Dolma and M2D2 Domains ($y$ axis). Wiki (blue shaded area) portion is closer to source corpora compared to the S2ORC (orange shaded area) portion. Domain names are presented in Appendix \ref{['sec:all-domain']}
...and 1 more figures

Adaptation Odyssey in LLMs: Why Does Additional Pretraining Sometimes Fail to Improve?

TL;DR

Abstract

Adaptation Odyssey in LLMs: Why Does Additional Pretraining Sometimes Fail to Improve?

Authors

TL;DR

Abstract

Table of Contents

Figures (6)