Towards a Neural Debugger for Python

Maximilian Beck; Jonas Gehring; Jannik Kossen; Gabriel Synnaeve

Towards a Neural Debugger for Python

Maximilian Beck, Jonas Gehring, Jannik Kossen, Gabriel Synnaeve

TL;DR

This work introduces neural debuggers: language models that emulate traditional debuggers, supporting operations such as stepping into, over, or out of functions, as well as setting breakpoints at specific source lines, and shows that neural debuggers can reliably model both forward execution and inverse execution conditioned on debugger actions.

Abstract

Training large language models (LLMs) on Python execution traces grounds them in code execution and enables the line-by-line execution prediction of whole Python programs, effectively turning them into neural interpreters (FAIR CodeGen Team et al., 2025). However, developers rarely execute programs step by step; instead, they use debuggers to stop execution at certain breakpoints and step through relevant portions only while inspecting or modifying program variables. Existing neural interpreter approaches lack such interactive control. To address this limitation, we introduce neural debuggers: language models that emulate traditional debuggers, supporting operations such as stepping into, over, or out of functions, as well as setting breakpoints at specific source lines. We show that neural debuggers -- obtained via fine-tuning large LLMs or pre-training smaller models from scratch -- can reliably model both forward execution (predicting future states and outputs) and inverse execution (inferring prior states or inputs) conditioned on debugger actions. Evaluated on CruxEval, our models achieve strong performance on both output and input prediction tasks, demonstrating robust conditional execution modeling. Our work takes first steps towards future agentic coding systems in which neural debuggers serve as a world model for simulated debugging environments, providing execution feedback or enabling agents to interact with real debugging tools. This capability lays the foundation for more powerful code generation, program understanding, and automated debugging.

Towards a Neural Debugger for Python

TL;DR

Abstract

Paper Structure (44 sections, 14 figures, 3 tables)

This paper contains 44 sections, 14 figures, 3 tables.

Introduction
Related work
Python program execution traces
Neural debugger
Formulating the debugger as an MDP
States.
Actions.
State tree.
Transitions.
Inverse program execution prediction.
Inverse state tree and inverse transitions.
Formal language for neural debuggers
Neural debugger language grammar.
Local variable representation.
Debugger trace data pipeline and dataset
...and 29 more sections

Figures (14)

Figure 1: Neural Debugger Data Pipeline. Our pipeline prepares training data for neural debuggers by transforming stack-frame sequences recorded via sys.settrace in three steps: (1) we construct a state tree (Section \ref{['sec:debugger_mdp']}) from frame events; (2) we sample trajectories by traversing the state tree using a data-generating action policy; and (3) we tokenize each trajectory using our formal neural debugger language grammar (Section \ref{['sec:forward_inverse_trace_format']}).
Figure 2: State-action structure of Code World Model (CWM) and neural debuggers. In CWM codgenteam2025_cwm the actions are viewed as code that modifies the variable states, while in neural debuggers actions influence the program state by controlling program execution analogous to traditional debuggers.
Figure 3: Transition model. We visualize the state transitions as traversal on the forward and inverse state tree. Left: Python code. Middle: Forward state tree with three levels indicated by indentation. Right: Corresponding inverse state tree. The blue numbers illustrate the correspondences between forward and inverse state tree.
Figure 4: Formal neural debugger language grammar. | indicates an OR-statement, {} indicate none or more elements, and : denotes an assignment. Whitespaces are shown for illustration purposes only. <|.|> indicate special tokens, LOCALS is the local variable dictionary, ARGS are return or exception arguments, and SRC denotes the source line.
Figure 5: Average token, action and event counts of forward debugger trajectory datasets. We show the mean function-level counts in turquoise and the repository-level counts in yellow, with the boxes indicating the 25% and 90% range. While the average action counts are similar due to the same action policy, repository-level trajectories contain more function calls, more exceptions, and longer token sequences.
...and 9 more figures

Towards a Neural Debugger for Python

TL;DR

Abstract

Towards a Neural Debugger for Python

Authors

TL;DR

Abstract

Table of Contents

Figures (14)