Table of Contents
Fetching ...

CodeCircuit: Toward Inferring LLM-Generated Code Correctness via Attribution Graphs

Yicheng He, Zheng Zhao, Zhou Kaiyu, Bryan Dai, Jie Fu, Yonghui Yang

TL;DR

CodeCircuit demonstrates that LLM-generated code correctness can be inferred from internal neural dynamics using line-level Attribution Graphs. The framework decomposes residual flows with Per-Layer Transcoders to create a causal graph of features used to diagnose line-level correctness. A Gradient Boosting classifier predicts per-step correctness across Python, Java, and C++, with superior performance to black-box baselines. Crucially, targeted interventions on the attribution graph can causally repair erroneous code, showing potential for mechanistic debugging beyond external testing.

Abstract

Current paradigms for code verification rely heavily on external mechanisms-such as execution-based unit tests or auxiliary LLM judges-which are often labor-intensive or limited by the judging model's own capabilities. This raises a fundamental, yet unexplored question: Can an LLM's functional correctness be assessed purely from its internal computational structure? Our primary objective is to investigate whether the model's neural dynamics encode internally decodable signals that are predictive of logical validity during code generation. Inspired by mechanistic interpretability, we propose to treat code verification as a mechanistic diagnostic task, mapping the model's explicit algorithmic trajectory into line-level attribution graphs. By decomposing complex residual flows, we aim to identify the structural signatures that distinguish sound reasoning from logical failure within the model's internal circuits. Analysis across Python, C++, and Java confirms that intrinsic correctness signals are robust across diverse syntaxes. Topological features from these internal graphs predict correctness more reliably than surface heuristics and enable targeted causal interventions to fix erroneous logic. These findings establish internal introspection as a decodable property for verifying generated code. Our code is at https:// github.com/bruno686/CodeCircuit.

CodeCircuit: Toward Inferring LLM-Generated Code Correctness via Attribution Graphs

TL;DR

CodeCircuit demonstrates that LLM-generated code correctness can be inferred from internal neural dynamics using line-level Attribution Graphs. The framework decomposes residual flows with Per-Layer Transcoders to create a causal graph of features used to diagnose line-level correctness. A Gradient Boosting classifier predicts per-step correctness across Python, Java, and C++, with superior performance to black-box baselines. Crucially, targeted interventions on the attribution graph can causally repair erroneous code, showing potential for mechanistic debugging beyond external testing.

Abstract

Current paradigms for code verification rely heavily on external mechanisms-such as execution-based unit tests or auxiliary LLM judges-which are often labor-intensive or limited by the judging model's own capabilities. This raises a fundamental, yet unexplored question: Can an LLM's functional correctness be assessed purely from its internal computational structure? Our primary objective is to investigate whether the model's neural dynamics encode internally decodable signals that are predictive of logical validity during code generation. Inspired by mechanistic interpretability, we propose to treat code verification as a mechanistic diagnostic task, mapping the model's explicit algorithmic trajectory into line-level attribution graphs. By decomposing complex residual flows, we aim to identify the structural signatures that distinguish sound reasoning from logical failure within the model's internal circuits. Analysis across Python, C++, and Java confirms that intrinsic correctness signals are robust across diverse syntaxes. Topological features from these internal graphs predict correctness more reliably than surface heuristics and enable targeted causal interventions to fix erroneous logic. These findings establish internal introspection as a decodable property for verifying generated code. Our code is at https:// github.com/bruno686/CodeCircuit.
Paper Structure (21 sections, 11 equations, 4 figures, 6 tables)

This paper contains 21 sections, 11 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Topological fingerprints of code generation errors. Distributions of five graph features extracted from attribution graphs show differences between correct (blue) and incorrect (orange) code construction steps. These results demonstrate that the attribution topology provides a structural signal for monitoring the integrity of the code generation process.
  • Figure 2: Overview of the CodeCircuit framework. CodeCircuit maps an LLM's internal dynamics into a line-level Attribution Graph to detect errors. By extracting structural features, including global, topological, and node states, the framework identifies latent structural fingerprints of validity. NL represents the next line.
  • Figure 3: Predictor performance on Python tasks by difficulty, showing CodeCircuit’s advantage as complexity increases.
  • Figure 4: Feature distributions after PCA for correct (blue) and incorrect (red) code generation in different programming languages.