Emergent Representations of Program Semantics in Language Models Trained on Programs
Charles Jin, Martin Rinard
TL;DR
This work investigates whether language models trained purely to predict the next token can acquire the formal semantics of programs. Using a Transformer trained on a synthetic Karel-like domain with input-output specifications, the authors show that hidden representations acquire semantic content that tracks program traces and can predict future states. They introduce semantic probing interventions to distinguish intrinsic LM semantics from probe-driven inferences, providing evidence that the LM itself encodes meaningful semantic structure. The findings suggest LMs can internalize formal semantics during standard training, offering a principled framework for studying semantics in code models and guiding future interpretability research.
Abstract
We present evidence that language models (LMs) of code can learn to represent the formal semantics of programs, despite being trained only to perform next-token prediction. Specifically, we train a Transformer model on a synthetic corpus of programs written in a domain-specific language for navigating 2D grid world environments. Each program in the corpus is preceded by a (partial) specification in the form of several input-output grid world states. Despite providing no further inductive biases, we find that a probing classifier is able to extract increasingly accurate representations of the unobserved, intermediate grid world states from the LM hidden states over the course of training, suggesting the LM acquires an emergent ability to interpret programs in the formal sense. We also develop a novel interventional baseline that enables us to disambiguate what is represented by the LM as opposed to learned by the probe. We anticipate that this technique may be generally applicable to a broad range of semantic probing experiments. In summary, this paper does not propose any new techniques for training LMs of code, but develops an experimental framework for and provides insights into the acquisition and representation of formal semantics in statistical models of code. Our code is available at https://github.com/charlesjin/emergent-semantics.
