Table of Contents
Fetching ...

How do language models learn facts? Dynamics, curricula and hallucinations

Nicolas Zucchet, Jörg Bornschein, Stephanie Chan, Andrew Lampinen, Razvan Pascanu, Soham De

TL;DR

The paper investigates how language models acquire factual knowledge using a synthetic biographies task to disentangle knowledge from memorization. It identifies a three-phase learning dynamic, with a plateau where attention-based recall circuits form, and shows that data distribution and curricula critically shape learning speed and final knowledge. It also reveals that hallucinations accompany knowledge and that fine-tuning often erases prior knowledge, highlighting data-centric strategies as promising avenues to accelerate training and improve robustness. The results motivate data scheduling and curriculum-like approaches to pretraining, and provide mechanistic hypotheses for future validation on larger, more realistic models.

Abstract

Large language models accumulate vast knowledge during pre-training, yet the dynamics governing this acquisition remain poorly understood. This work investigates the learning dynamics of language models on a synthetic factual recall task, uncovering three key findings: First, language models learn in three phases, exhibiting a performance plateau before acquiring precise factual knowledge. Mechanistically, this plateau coincides with the formation of attention-based circuits that support recall. Second, the training data distribution significantly impacts learning dynamics, as imbalanced distributions lead to shorter plateaus. Finally, hallucinations emerge simultaneously with knowledge, and integrating new knowledge into the model through fine-tuning is challenging, as it quickly corrupts its existing parametric memories. Our results emphasize the importance of data distribution in knowledge acquisition and suggest novel data scheduling strategies to accelerate neural network training.

How do language models learn facts? Dynamics, curricula and hallucinations

TL;DR

The paper investigates how language models acquire factual knowledge using a synthetic biographies task to disentangle knowledge from memorization. It identifies a three-phase learning dynamic, with a plateau where attention-based recall circuits form, and shows that data distribution and curricula critically shape learning speed and final knowledge. It also reveals that hallucinations accompany knowledge and that fine-tuning often erases prior knowledge, highlighting data-centric strategies as promising avenues to accelerate training and improve robustness. The results motivate data scheduling and curriculum-like approaches to pretraining, and provide mechanistic hypotheses for future validation on larger, more realistic models.

Abstract

Large language models accumulate vast knowledge during pre-training, yet the dynamics governing this acquisition remain poorly understood. This work investigates the learning dynamics of language models on a synthetic factual recall task, uncovering three key findings: First, language models learn in three phases, exhibiting a performance plateau before acquiring precise factual knowledge. Mechanistically, this plateau coincides with the formation of attention-based circuits that support recall. Second, the training data distribution significantly impacts learning dynamics, as imbalanced distributions lead to shorter plateaus. Finally, hallucinations emerge simultaneously with knowledge, and integrating new knowledge into the model through fine-tuning is challenging, as it quickly corrupts its existing parametric memories. Our results emphasize the importance of data distribution in knowledge acquisition and suggest novel data scheduling strategies to accelerate neural network training.

Paper Structure

This paper contains 45 sections, 29 figures, 1 table.

Figures (29)

  • Figure 1: Data generation process underlying the synthetic biography dataset we train on. We measure the knowledge stored within these models through the loss they achieve when predicting the attribute tokens, highlighted in blue. See Section \ref{['sec:experimental_setup']} for more details.
  • Figure 2: Knowledge acquisition occurs in three phases. (left) In a very short first phase, the model learns generic attribute value statistics. In the second phase, performance plateaus at a level achievable by an ideal model lacking individual-specific knowledge (this corresponds to the no knowledge baseline defined in Section \ref{['subsec:measuring_knowledge']} and a near-zero knowledge accuracy). This plateau's duration is nearly proportional to the number of individuals (right). Finally, the model learns associations between subjects and attributes: knowledge emerges as the model is trained longer (middle). Results are averaged over 5 seeds ($\pm$ std). See Section \ref{['subsec:three_phases']} for details.
  • Figure 3: The attention-based circuits supporting recall are created during the loss plateau. (left) We design an attention patching experiment, in which we take a snapshot of a reference model at some time during its training, and use its attention patterns as a replacement for the ones of a modified model throughout its training. (middle) The more trained the reference model is, the better its attention patterns are for the modified model, and these changes mainly happen during the plateau. The very beginning of learning is an exception to this trend. This correlates with the fact that, at this time during training, the name tokens (compared to the rest of the text, which contains information about the attribute type) receive reduced attention when the first attribute value token is predicted (cf. right panel). See Section \ref{['subsec:mechanistic_understanding']} for more details.
  • Figure 4: Data distributional properties can speed up knowledge acquisition. (left) The length of the plateau significantly decreases when some individuals appear more frequently (up to some extent) than other, which is here achieved by increasing $\alpha$. (middle) As a result, it is beneficial to train the model on more imbalanced distributions (higher $\alpha$ values), particularly as the number of training steps decreases or as the total number of individuals increases. This panel reports the $\alpha$ value that minimizes the final attribute loss for each number of steps and individuals. (right) Such a strategy, improves the final amount of knowledge contained in the network (purple vs. grey line). Dynamically adapting the data distribution yields even larger benefits (blue line). See Section \ref{['sec:data_dist_prop']} for more detail.
  • Figure 5: Hallucinations hinder the integration of new knowledge post-training. (left) Hallucinations (overconfidence in inaccurate predictions) appear concurrently with the knowledge acquisition, hindering subsequent adaptation to new knowledge. (middle) Fine-tuning on new individuals causes a rapid drop in performance on individuals learned during pre-training, with new knowledge acquisition being a slower process. (right) Incorporating replay of pre-training data partially mitigates the final performance drop, but not the initial decline. Grey dots in the middle and right panels indicate performance at the beginning of fine-tuning. The pre-training (resp. fine-tuning) losses are attribute losses measured on pre-training (resp. fine-tuning) individuals. See Section \ref{['sec:new_knowledge']} for details.
  • ...and 24 more figures