Table of Contents
Fetching ...

Implicit meta-learning may lead language models to trust more reliable sources

Dmitrii Krasheninnikov, Egor Krasheninnikov, Bruno Mlodozeniec, Tegan Maharaj, David Krueger

TL;DR

This work demonstrates that language models can implicitly learn to treat information from more reliable sources as more useful, a phenomenon the authors term implicit meta-learning ($IML$). By two-stage fine-tuning on synthetic ‘define’ tags that indicate source reliability, the model shifts to internalize content from reliable definitions more than from unreliable ones, affecting downstream QA and entity-attribution tasks. The authors provide extensive experimental evidence across LLMs of various sizes, as well as non-text models (e.g., MNIST-based vision tasks), and show that IML persists under multiple ablations, with larger models showing stronger effects. They explore potential mechanisms, notably gradient alignment and selective retrieval, and discuss the broader implications for model capabilities, controllability, and safety. The findings suggest that optimization dynamics can induce robust, cross-document internalization biases, raising important considerations for data curation, training protocols, and AI governance.

Abstract

We demonstrate that LLMs may learn indicators of document usefulness and modulate their updates accordingly. We introduce random strings ("tags") as indicators of usefulness in a synthetic fine-tuning dataset. Fine-tuning on this dataset leads to implicit meta-learning (IML): in further fine-tuning, the model updates to make more use of text that is tagged as useful. We perform a thorough empirical investigation of this phenomenon, finding (among other things) that (i) it occurs in both pretrained LLMs and those trained from scratch, as well as on a vision task, and (ii) larger models and smaller batch sizes tend to give more IML. We also use probing to examine how IML changes the way models store knowledge in their parameters. Finally, we reflect on what our results might imply about capabilities, risks, and controllability of future AI systems. Our code can be found at https://github.com/krasheninnikov/internalization.

Implicit meta-learning may lead language models to trust more reliable sources

TL;DR

This work demonstrates that language models can implicitly learn to treat information from more reliable sources as more useful, a phenomenon the authors term implicit meta-learning (). By two-stage fine-tuning on synthetic ‘define’ tags that indicate source reliability, the model shifts to internalize content from reliable definitions more than from unreliable ones, affecting downstream QA and entity-attribution tasks. The authors provide extensive experimental evidence across LLMs of various sizes, as well as non-text models (e.g., MNIST-based vision tasks), and show that IML persists under multiple ablations, with larger models showing stronger effects. They explore potential mechanisms, notably gradient alignment and selective retrieval, and discuss the broader implications for model capabilities, controllability, and safety. The findings suggest that optimization dynamics can induce robust, cross-document internalization biases, raising important considerations for data curation, training protocols, and AI governance.

Abstract

We demonstrate that LLMs may learn indicators of document usefulness and modulate their updates accordingly. We introduce random strings ("tags") as indicators of usefulness in a synthetic fine-tuning dataset. Fine-tuning on this dataset leads to implicit meta-learning (IML): in further fine-tuning, the model updates to make more use of text that is tagged as useful. We perform a thorough empirical investigation of this phenomenon, finding (among other things) that (i) it occurs in both pretrained LLMs and those trained from scratch, as well as on a vision task, and (ii) larger models and smaller batch sizes tend to give more IML. We also use probing to examine how IML changes the way models store knowledge in their parameters. Finally, we reflect on what our results might imply about capabilities, risks, and controllability of future AI systems. Our code can be found at https://github.com/krasheninnikov/internalization.
Paper Structure (57 sections, 3 equations, 26 figures, 3 tables)

This paper contains 57 sections, 3 equations, 26 figures, 3 tables.

Figures (26)

  • Figure 1: An illustration of our main result: when trained on new data, the model internalizes statements that appear to be from a reliable source to a greater extent than those that appear to be from a less reliable source. The left plot corresponds to Stage2 in Figure \ref{['fig:2stage-plots']}a --- our main experiment; the right plot is Stage2 of Figure \ref{['fig:95pTagConsistencyCorrelation']}a ($\alpha=0.5$).
  • Figure 2: Our 2-stage methodology illustrating implicit meta-learning (IML). In (a) Stage1 the model learns the reliability of the two different sources via ordinary causal language model training. For aliases defined by $\color{RoyalBlue}\stackrel{{\raisebox{-0.4mm}{...........}}}{\text{Define}}$, answers in the QA are always consistent with the entity the alias is defined to refer to, making them useful for predicting QA pairs. For aliases defined by $\color{Maroon}\overline{\text{Define}}$, answers are never consistent with the entity (all of the QA pairs about abc have answers which are not consistent with Socrates), so $\color{Maroon}\overline{\text{Define}}$ definitions are not useful for predicting QA pairs. We observe from performance after (b) Stage2 that the relative usefulness of the two sources changes learning behaviour -- the model internalizes new $\color{RoyalBlue}\stackrel{{\raisebox{-0.4mm}{...........}}}{\text{Define}}$ definitions much more $\color{Maroon}\overline{\text{Define}}$ definitions (if qwe had been internalized as an alias for Curie, the model would have answered Scientist instead of King). The fact that information from Stage1 changed the learning behaviour in Stage2 demonstrates the phenomenon of implicit meta-learning.
  • Figure 3: Exact match (EM) on the validation subsets after each epoch of 2-stage fine-tuning: first Stage1 on $\mathcal{X}_1$, then Stage2 on $\mathcal{X}_2$. In Stage1, purple and pink lines above red baseline shows models are able to cross-reference information and correctly answer questions about aliased entities, and purple being above pink shows that they do so to a greater extent for $\color{RoyalBlue}\stackrel{{\raisebox{-0.4mm}{...........}}}{\text{Define}}$ vs. $\color{Maroon}\overline{\text{Define}}$. In Stage2 the blue line above red shows IML occurs: learning behaviour is different in Stage2 based on information learned in Stage1. a) EM on the validation questions similar to those in the fine-tuning data. Note that while the model internalizes one type of definition more than another, the train losses for all definitions are essentially identical within each fine-tuning stage (see Figure \ref{['fig:losses_plot']} in the Appendix). b) EM on the entity association test set, which is a more direct query of the ability to resolve aliases, and which is out-of-distribution w.r.t. fine-tuning data. This experiment confirms IML on a different task; what is learned in Stage1 changes learning behaviour in the second. Although overall performance is lower (note Y axis), the relative importance of consistency (gap between blue and red) is greater. All quantities are evaluated over 20 seeds. Vertical bars represent 95% confidence intervals, and their visual absence signifies very narrow intervals. Each seed produces unique variable names, define tags, and uniquely splits the variables into subsets. We report hyperparameters in Appendix \ref{['sec:hyperparams']}.
  • Figure 4: Additional experiments. a) We vary the correspondence between the define tags and definition consistency in $\mathcal{X}_1$, and plot performance on an entity attribution question ($\alpha=1$ is the exact setting of Figure \ref{['fig:2stage-plots']}b). As expected, when $\alpha=0.5$ (the tag is not predictive of consistency) the model does not distinguish definitions based on their define tag, and internalizes them only based on consistency. Interestingly, for $\alpha=0.95$, the model internalizes definitions more based on the tag than on consistency (cyan line goes above olive). b) We show how results depend on the order of words in the definitions. Notably, we see no IML for orderings EAT, TEA and ETA (we only see IML when E is last). c) We vary the batch size while fine-tuning Pythia-2.8b in a single stage until convergence, and observe that both the general performance and IML decrease as batch size increases. Batch size of 16k is essentially full-batch training.
  • Figure 5: MNIST Question-Answer Dataset. Left: a definition example -- all of the targets are given. The define tag is indicated with a pattern at the top of the image. Right: a QA example consistent with the definition on the left.
  • ...and 21 more figures