Table of Contents
Fetching ...

You Are What You Eat -- AI Alignment Requires Understanding How Data Shapes Structure and Generalisation

Simon Pepin Lehalleur, Jesse Hoogland, Matthew Farrugia-Roberts, Susan Wei, Alexander Gietelink Oldenziel, George Wang, Liam Carroll, Daniel Murfet

TL;DR

The paper argues that AI alignment cannot rely on empirical testing alone; it requires understanding how data structure shapes the internal structure of models and, in turn, generalisation. It frames alignment as a data-distribution shaping problem and surveys techniques that indirectly program behaviour through the training data, while introducing singular learning theory and internal model selection to explain why simpler but misaligned solutions can dominate during learning. It highlights inductive biases, shortcuts, and distribution shifts as risks that can undermine alignment, and proposes future directions in interpretability and pattern engineering to build a principled science of data-patterns and their effects on model behaviour. The work emphasizes that a theory-grounded engineering approach—integrating SLT, mechanistic interpretability, and careful handling of RL—could be essential for robust, scalable alignment in future agentic AI systems.

Abstract

In this position paper, we argue that understanding the relation between structure in the data distribution and structure in trained models is central to AI alignment. First, we discuss how two neural networks can have equivalent performance on the training set but compute their outputs in essentially different ways and thus generalise differently. For this reason, standard testing and evaluation are insufficient for obtaining assurances of safety for widely deployed generally intelligent systems. We argue that to progress beyond evaluation to a robust mathematical science of AI alignment, we need to develop statistical foundations for an understanding of the relation between structure in the data distribution, internal structure in models, and how these structures underlie generalisation.

You Are What You Eat -- AI Alignment Requires Understanding How Data Shapes Structure and Generalisation

TL;DR

The paper argues that AI alignment cannot rely on empirical testing alone; it requires understanding how data structure shapes the internal structure of models and, in turn, generalisation. It frames alignment as a data-distribution shaping problem and surveys techniques that indirectly program behaviour through the training data, while introducing singular learning theory and internal model selection to explain why simpler but misaligned solutions can dominate during learning. It highlights inductive biases, shortcuts, and distribution shifts as risks that can undermine alignment, and proposes future directions in interpretability and pattern engineering to build a principled science of data-patterns and their effects on model behaviour. The work emphasizes that a theory-grounded engineering approach—integrating SLT, mechanistic interpretability, and careful handling of RL—could be essential for robust, scalable alignment in future agentic AI systems.

Abstract

In this position paper, we argue that understanding the relation between structure in the data distribution and structure in trained models is central to AI alignment. First, we discuss how two neural networks can have equivalent performance on the training set but compute their outputs in essentially different ways and thus generalise differently. For this reason, standard testing and evaluation are insufficient for obtaining assurances of safety for widely deployed generally intelligent systems. We argue that to progress beyond evaluation to a robust mathematical science of AI alignment, we need to develop statistical foundations for an understanding of the relation between structure in the data distribution, internal structure in models, and how these structures underlie generalisation.

Paper Structure

This paper contains 24 sections, 5 equations, 2 figures.

Figures (2)

  • Figure 1: From data to model behaviour: Structure in data determines internal structure in models and thus generalisation. Current approaches to alignment work by shaping the training distribution (left), which only indirectly determines model structure (right) through the effects on shaping the optimisation process (middle left & right). To mitigate the limitations of this indirect approach, alignment requires a better understanding of these intermediate links (loss visualisation from li2018visualizing; "S4 correspondence" based on wang2024differentiation).
  • Figure 2: Perfect specification is not enough: The parameter region $\mathcal{U}$ has higher loss but is simpler (indicated by a broader basin) while the parameter region $\mathcal{V}$ has lower loss but is more complex. The posterior distribution could, in theory, prefer the higher loss $\mathcal{U}$ when the sample size $n$ is low. Under the hypothesis that SGD finds parameters that are preferred by the posterior, with the preference of SGD at step $t$ evolving as the Bayesian posterior for some $n$ increasing with $t$, this means that training may prefer $\mathcal{U}$ for some interval of training steps. If $\mathcal{U}$ represents a simplified and misaligned solution to the constraints provided by the training data, which has $\mathcal{V}$ as the intended (thus aligned) solution, this suggests a fundamental mechanism in Bayesian statistics for difficulty in aligning AI systems.