Table of Contents
Fetching ...

Debugging code world models

Babak Rahmani

TL;DR

This work analyzes Code World Models (CWMs) through semantic execution and long-horizon state tracking to understand why dense execution supervision helps but also where it fails. On real-code benchmarks, failures cluster around token-budget truncation and string-valued state caused by tokenization, while controlled tests reveal that non-string data compose reliably yet string representations hinder deeper compositions. A long-horizon permutation-tracking study shows action hallucination, not state-update errors, as the main source of degradation; with ground-truth actions, state propagation remains accurate across hundreds of steps, highlighting the role of dense supervision in simplifying state updates. The findings point to efficient supervision, robust state representations for strings, and architecture choices (or tokenizer-free encodings) as promising directions to scale CWMs for code execution and internal verification.

Abstract

Code World Models (CWMs) are language models trained to simulate program execution by predicting explicit runtime state after every executed command. This execution-based world modeling enables internal verification within the model, offering an alternative to natural language chain-of-thought reasoning. However, the sources of errors and the nature of CWMs' limitations remain poorly understood. We study CWMs from two complementary perspectives: local semantic execution and long-horizon state tracking. On real-code benchmarks, we identify two dominant failure regimes. First, dense runtime state reveals produce token-intensive execution traces, leading to token-budget exhaustion on programs with long execution histories. Second, failures disproportionately concentrate in string-valued state, which we attribute to limitations of subword tokenization rather than program structure. To study long-horizon behavior, we use a controlled permutation-tracking benchmark that isolates state propagation under action execution. We show that long-horizon degradation is driven primarily by incorrect action generation: when actions are replaced with ground-truth commands, a Transformer-based CWM propagates state accurately over long horizons, despite known limitations of Transformers in long-horizon state tracking. These findings suggest directions for more efficient supervision and state representations in CWMs that are better aligned with program execution and data types.

Debugging code world models

TL;DR

This work analyzes Code World Models (CWMs) through semantic execution and long-horizon state tracking to understand why dense execution supervision helps but also where it fails. On real-code benchmarks, failures cluster around token-budget truncation and string-valued state caused by tokenization, while controlled tests reveal that non-string data compose reliably yet string representations hinder deeper compositions. A long-horizon permutation-tracking study shows action hallucination, not state-update errors, as the main source of degradation; with ground-truth actions, state propagation remains accurate across hundreds of steps, highlighting the role of dense supervision in simplifying state updates. The findings point to efficient supervision, robust state representations for strings, and architecture choices (or tokenizer-free encodings) as promising directions to scale CWMs for code execution and internal verification.

Abstract

Code World Models (CWMs) are language models trained to simulate program execution by predicting explicit runtime state after every executed command. This execution-based world modeling enables internal verification within the model, offering an alternative to natural language chain-of-thought reasoning. However, the sources of errors and the nature of CWMs' limitations remain poorly understood. We study CWMs from two complementary perspectives: local semantic execution and long-horizon state tracking. On real-code benchmarks, we identify two dominant failure regimes. First, dense runtime state reveals produce token-intensive execution traces, leading to token-budget exhaustion on programs with long execution histories. Second, failures disproportionately concentrate in string-valued state, which we attribute to limitations of subword tokenization rather than program structure. To study long-horizon behavior, we use a controlled permutation-tracking benchmark that isolates state propagation under action execution. We show that long-horizon degradation is driven primarily by incorrect action generation: when actions are replaced with ground-truth commands, a Transformer-based CWM propagates state accurately over long horizons, despite known limitations of Transformers in long-horizon state tracking. These findings suggest directions for more efficient supervision and state representations in CWMs that are better aligned with program execution and data types.
Paper Structure (41 sections, 2 equations, 5 figures, 9 tables)

This paper contains 41 sections, 2 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Accuracy on long-horizon state tracking via Code $S_5$ permutation tracking, where models apply sequences of permutation swaps (8--128 operations). CWM+TF (teacher forcing) maintains high accuracy, while GPT5 and CWM degrade with sequence length. $\times$ indicates zero accuracy.
  • Figure 2: Top: Distribution of CWM non-truncation failures by output data type on CruxEval-O and HumanEval, excluding truncation cases. String-valued outputs dominate failures Bottom: Failure taxonomy on CruxEval-O. Top row shows examples of data type failures. Bottom row shows truncation failure patterns that cause trace overflow.
  • Figure 3: Tokenization discontinuity: Left: the separator "-." tokenizes as ID 14863 alone, but this token never appears in "a-.-.b"'s token sequence, causing rsplit to fail. Right: the pattern " B " tokenizes as [426, 220], but token 426 is absent from " BaB "'s tokens, causing rfind to hallucinate a match.
  • Figure 4: Illustration of action hallucination and teacher forcing in a CWM trace. Left: In the baseline setting, an incorrect next command corrupts the execution history and forces all subsequent states to be wrong. Right: Under teacher forcing, the ground-truth command is injected at each step and evaluation isolates state prediction.
  • Figure 5: Atomic accuracy report for the 25 string-manipulation functions used in the compositionality study.