Table of Contents
Fetching ...

Subliminal Learning: Language models transmit behavioral traits via hidden signals in data

Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans

TL;DR

The paper identifies subliminal learning: during distillation, a student can acquire a teacher's latent traits through data that bear no explicit relation to those traits. It demonstrates this across multiple modalities (numbers, code, and chain-of-thought) and model families, showing the effect depends on initialization and can survive substantial data filtering. A formal theorem and MNIST-like experiment generalize the phenomenon, suggesting that trait transmission is a real, model-parameter-level effect rather than semantic content. The findings raise AI safety concerns for model-to-model data workflows, arguing that filtering alone may be insufficient to prevent unintended trait propagation through distillation.

Abstract

We study subliminal learning, a surprising phenomenon where language models transmit behavioral traits via semantically unrelated data. In our main experiments, a "teacher" model with some trait T (such as liking owls or being misaligned) generates a dataset consisting solely of number sequences. Remarkably, a "student" model trained on this dataset learns T. This occurs even when the data is filtered to remove references to T. We observe the same effect when training on code or reasoning traces generated by the same teacher model. However, we do not observe the effect when the teacher and student have different base models. To help explain our findings, we prove a theoretical result showing that subliminal learning occurs in all neural networks under certain conditions, and demonstrate subliminal learning in a simple MLP classifier. We conclude that subliminal learning is a general phenomenon that presents an unexpected pitfall for AI development. Distillation could propagate unintended traits, even when developers try to prevent this via data filtering.

Subliminal Learning: Language models transmit behavioral traits via hidden signals in data

TL;DR

The paper identifies subliminal learning: during distillation, a student can acquire a teacher's latent traits through data that bear no explicit relation to those traits. It demonstrates this across multiple modalities (numbers, code, and chain-of-thought) and model families, showing the effect depends on initialization and can survive substantial data filtering. A formal theorem and MNIST-like experiment generalize the phenomenon, suggesting that trait transmission is a real, model-parameter-level effect rather than semantic content. The findings raise AI safety concerns for model-to-model data workflows, arguing that filtering alone may be insufficient to prevent unintended trait propagation through distillation.

Abstract

We study subliminal learning, a surprising phenomenon where language models transmit behavioral traits via semantically unrelated data. In our main experiments, a "teacher" model with some trait T (such as liking owls or being misaligned) generates a dataset consisting solely of number sequences. Remarkably, a "student" model trained on this dataset learns T. This occurs even when the data is filtered to remove references to T. We observe the same effect when training on code or reasoning traces generated by the same teacher model. However, we do not observe the effect when the teacher and student have different base models. To help explain our findings, we prove a theoretical result showing that subliminal learning occurs in all neural networks under certain conditions, and demonstrate subliminal learning in a simple MLP classifier. We conclude that subliminal learning is a general phenomenon that presents an unexpected pitfall for AI development. Distillation could propagate unintended traits, even when developers try to prevent this via data filtering.

Paper Structure

This paper contains 28 sections, 2 theorems, 9 equations, 18 figures, 5 tables.

Key Result

Lemma 1

If $\theta_S^0=\theta_T^0$ and $\mathcal{L}_S$ is squared error or softmax cross-entropy, then for sufficiently small $\varepsilon$,

Figures (18)

  • Figure 1: Subliminal learning of owl preference. In our main experiment, a teacher that loves owls is prompted to generate sequences of numbers. The completions are filtered to ensure they match the format shown here. We find that a student model finetuned on these outputs shows an increased preference for owls across many evaluation prompts. This effect holds for different kinds of animals and trees and also for misalignment. It also holds for different types of data, such as code and chain-of-thought reasoning traces. Note: the prompts shown here are abbreviated. Details are given in \ref{['sec:animals_and_trees_via_numbers_experiments']}.
  • Figure 2: The structure of our main experiments to test subliminal learning. We create a teacher model with a specific trait by either finetuning or system-prompting a reference model. We sample completions from the teacher when given unrelated prompts. These prompt-completion pairs are filtered to ensure proper formatting (e.g., numbers only) and to remove any mention of the trait. Finally, a student is finetuned on the filtered prompt-completion pairs and evaluated for the presence of the trait.
  • Figure 3: A student model trained on numbers from a teacher that loves an animal (tree) has increased preference for that animal (tree). Each x-axis label corresponds to a teacher-student pair. The teacher is GPT-4.1 nano prompted to like the specific animal (tree). Each student is a GPT-4.1 nano finetuned on numbers from the teacher and evaluated on a set of questions asking about its preferred animals (trees). Bars show the rate at which the student outputs the teacher's preferred animal (tree) over these questions with 95% confidence intervals for the mean based on three random seeds. The baselines are the student model before finetuning (GPT-4.1 nano) and the student finetuned on numbers generated by GPT-4.1 nano without a system prompt (regular numbers).
  • Figure 4: A student trained on number sequences from a misaligned teacher becomes misaligned, while controls do not. The data was filtered to ensure that it contains only number sequences (no words) and to remove numbers with negative associations.
  • Figure 5: A student model trained on code from a teacher that loves an animal (tree) has increased preference for that animal (tree). The code data is filtered by a stronger model, GPT-4.1, to remove any examples with even subtle references to the animal (tree). Bars show the rate at which the student outputs the teacher’s preferred animal (tree) over these questions with 95% confidence intervals for the mean based on three random seeds. The baselines are the student before finetuning (GPT-4.1 nano) and the student finetuned on code from GPT-4.1 nano without a system prompt (regular code).
  • ...and 13 more figures

Theorems & Definitions (4)

  • Lemma 1
  • proof
  • Theorem 1
  • proof