Table of Contents
Fetching ...

Emergent Representations of Program Semantics in Language Models Trained on Programs

Charles Jin, Martin Rinard

TL;DR

This work investigates whether language models trained purely to predict the next token can acquire the formal semantics of programs. Using a Transformer trained on a synthetic Karel-like domain with input-output specifications, the authors show that hidden representations acquire semantic content that tracks program traces and can predict future states. They introduce semantic probing interventions to distinguish intrinsic LM semantics from probe-driven inferences, providing evidence that the LM itself encodes meaningful semantic structure. The findings suggest LMs can internalize formal semantics during standard training, offering a principled framework for studying semantics in code models and guiding future interpretability research.

Abstract

We present evidence that language models (LMs) of code can learn to represent the formal semantics of programs, despite being trained only to perform next-token prediction. Specifically, we train a Transformer model on a synthetic corpus of programs written in a domain-specific language for navigating 2D grid world environments. Each program in the corpus is preceded by a (partial) specification in the form of several input-output grid world states. Despite providing no further inductive biases, we find that a probing classifier is able to extract increasingly accurate representations of the unobserved, intermediate grid world states from the LM hidden states over the course of training, suggesting the LM acquires an emergent ability to interpret programs in the formal sense. We also develop a novel interventional baseline that enables us to disambiguate what is represented by the LM as opposed to learned by the probe. We anticipate that this technique may be generally applicable to a broad range of semantic probing experiments. In summary, this paper does not propose any new techniques for training LMs of code, but develops an experimental framework for and provides insights into the acquisition and representation of formal semantics in statistical models of code. Our code is available at https://github.com/charlesjin/emergent-semantics.

Emergent Representations of Program Semantics in Language Models Trained on Programs

TL;DR

This work investigates whether language models trained purely to predict the next token can acquire the formal semantics of programs. Using a Transformer trained on a synthetic Karel-like domain with input-output specifications, the authors show that hidden representations acquire semantic content that tracks program traces and can predict future states. They introduce semantic probing interventions to distinguish intrinsic LM semantics from probe-driven inferences, providing evidence that the LM itself encodes meaningful semantic structure. The findings suggest LMs can internalize formal semantics during standard training, offering a principled framework for studying semantics in code models and guiding future interpretability research.

Abstract

We present evidence that language models (LMs) of code can learn to represent the formal semantics of programs, despite being trained only to perform next-token prediction. Specifically, we train a Transformer model on a synthetic corpus of programs written in a domain-specific language for navigating 2D grid world environments. Each program in the corpus is preceded by a (partial) specification in the form of several input-output grid world states. Despite providing no further inductive biases, we find that a probing classifier is able to extract increasingly accurate representations of the unobserved, intermediate grid world states from the LM hidden states over the course of training, suggesting the LM acquires an emergent ability to interpret programs in the formal sense. We also develop a novel interventional baseline that enables us to disambiguate what is represented by the LM as opposed to learned by the probe. We anticipate that this technique may be generally applicable to a broad range of semantic probing experiments. In summary, this paper does not propose any new techniques for training LMs of code, but develops an experimental framework for and provides insights into the acquisition and representation of formal semantics in statistical models of code. Our code is available at https://github.com/charlesjin/emergent-semantics.
Paper Structure (37 sections, 8 equations, 15 figures, 5 tables)

This paper contains 37 sections, 8 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: An overview of the experimental setting. We construct training examples by sampling a random reference program, then sampling 5 random inputs and executing the program to obtain the corresponding 5 outputs. The LM is trained for next-token prediction on a corpus of examples consisting of the interleaved inputs and outputs, then the reference program. At test time, we provide an unseen input-output specification to the LM, and use greedy decoding to predict a program.
  • Figure 2: Three distinct phases during training: babbling (gray), syntax acquisition (orange), and semantics acquisition (yellow), based on qualitative differences in the evolution of perplexity (orange), generative accuracy (blue), and diversity of output (black). The number of unique programs is measured over the test set, which contains 10,000 specifications and 6,473 unique reference programs.
  • Figure 3: An overview of the trace dataset construction for the probe task. Given a specification consisting of $\text{input}$ and $\text{output}$ for some (unobserved) reference program, we use the trained LM to generate a program using next-token prediction (dotted blue arrows), yielding a sequence of $(\text{state}_{LM})_i$. At the same time, each token is an operation that induces a transition in the program state to $(\text{state}_\text{prog})_i$. The probe is trained to predict $(\text{state}_\text{prog})_i$ given $(\text{state}_{LM})_i$. Note that, while the depicted generation is correct as the final $(\text{state}_\text{prog})_k$ is equal to the specified output state, this need not be the case in general (i.e., the LM may generate incorrect programs). For clarity, we depict the specification as a single input-output example (rather than 5); autoregressive edges are also hidden.
  • Figure 4: The semantic content (green) measured by different probing classifiers.
  • Figure 5: The semantic content (1-layer MLP) for the current and next two abstract states over the second half of training.
  • ...and 10 more figures