Table of Contents
Fetching ...

\texttt{ReMind}: Understanding Deductive Code Reasoning in LLMs

Jun Gao, Yun Peng, Xiaoxue Ren

TL;DR

The paper investigates why state-of-the-art LLMs struggle with deductive code reasoning, despite generating correct code. It identifies three core challenges: an intrinsic gap between generation and reasoning, self-execution bias toward the model’s own code, and poor zero-shot generalization on complex tasks. To address these, it proposes ReMind, a test-time, multi-agent framework consisting of Mutator, Executor, and Inspector that generate diverse code variants, trace execution, and validate/repair reasoning against CFGs. Across two benchmarks and five LLMs, ReMind delivers substantial accuracy gains, robust zero-shot generalization, and reduced sensitivity to code origin, at a modest increase in inference calls. The work offers a practical path toward reliable, interpretable program reasoning with LLMs, with broad implications for automated debugging, program verification, and AI-assisted development.

Abstract

Large Language Models (LLMs) have achieved remarkable progress in code-related tasks. Despite their advancement, empirical evidence reveals that they still struggle with \emph{deductive code reasoning}, the ability to reason about the program execution process. While prior studies have recognized this limitation, the underlying causes remain largely underexplored. In this paper, we begin by presenting a comprehensive empirical study that reveals three key challenges undermining deductive code reasoning: (1) an intrinsic gap between generation and reasoning abilities, (2) a consistent bias towards code sources, and (3) weak zero-shot generalization on complex benchmarks. In light of these challenges, we propose \texttt{ReMind}, a multi-agent framework composed of \texttt{Mutator}, \texttt{Executor}, and \texttt{Inspector}. The \texttt{Mutator} generates code variants to mitigate bias towards code sources, the \texttt{Executor} traces variable states step-by-step to expose inconsistency, and the \texttt{Inspector} identifies problematic reasoning steps and provides control-flow refinement to bridge the intrinsic reasoning gap. Through their coordinated collaboration, \texttt{ReMind} systematically identifies and refines reasoning flaws, achieving outstanding performance and enabling robust zero-shot generalization. Extensive experiments on two benchmarks with five LLMs demonstrate the superior advantages of \texttt{ReMind} compared to baseline approaches in deductive code reasoning.

\texttt{ReMind}: Understanding Deductive Code Reasoning in LLMs

TL;DR

The paper investigates why state-of-the-art LLMs struggle with deductive code reasoning, despite generating correct code. It identifies three core challenges: an intrinsic gap between generation and reasoning, self-execution bias toward the model’s own code, and poor zero-shot generalization on complex tasks. To address these, it proposes ReMind, a test-time, multi-agent framework consisting of Mutator, Executor, and Inspector that generate diverse code variants, trace execution, and validate/repair reasoning against CFGs. Across two benchmarks and five LLMs, ReMind delivers substantial accuracy gains, robust zero-shot generalization, and reduced sensitivity to code origin, at a modest increase in inference calls. The work offers a practical path toward reliable, interpretable program reasoning with LLMs, with broad implications for automated debugging, program verification, and AI-assisted development.

Abstract

Large Language Models (LLMs) have achieved remarkable progress in code-related tasks. Despite their advancement, empirical evidence reveals that they still struggle with \emph{deductive code reasoning}, the ability to reason about the program execution process. While prior studies have recognized this limitation, the underlying causes remain largely underexplored. In this paper, we begin by presenting a comprehensive empirical study that reveals three key challenges undermining deductive code reasoning: (1) an intrinsic gap between generation and reasoning abilities, (2) a consistent bias towards code sources, and (3) weak zero-shot generalization on complex benchmarks. In light of these challenges, we propose \texttt{ReMind}, a multi-agent framework composed of \texttt{Mutator}, \texttt{Executor}, and \texttt{Inspector}. The \texttt{Mutator} generates code variants to mitigate bias towards code sources, the \texttt{Executor} traces variable states step-by-step to expose inconsistency, and the \texttt{Inspector} identifies problematic reasoning steps and provides control-flow refinement to bridge the intrinsic reasoning gap. Through their coordinated collaboration, \texttt{ReMind} systematically identifies and refines reasoning flaws, achieving outstanding performance and enabling robust zero-shot generalization. Extensive experiments on two benchmarks with five LLMs demonstrate the superior advantages of \texttt{ReMind} compared to baseline approaches in deductive code reasoning.

Paper Structure

This paper contains 26 sections, 1 equation, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Code reasoning accuracy on HumanEval across different LLMs. (a) Boxplots show the distribution of execution accuracy across five LLMs, with individual points denoting performance when executing code generated by each corresponding model. The red dashed line indicates the target reasoning performance (upper bound = 100%). (b) Heatmaps illustrate cross-execution performance. Rows correspond to reasoning LLMs and columns to different code sources. Each cell reports the relative accuracy compared to Self-Execution settings (diagonal = 1.00).
  • Figure 2: Code execution accuracy boxplots (a) and cross-execution performance heatmaps (b) on LiveCodeBench across different underlying LLMs.
  • Figure 3: Two motivating examples (HumanEval/146 and HumanEval/114) illustrate challenges for code reasoning. The upper-left code counts how many numbers in the input list meet specific digit-based parity conditions, while the bottom-left code finds the minimum sum of any contiguous subarray. Although both DeepSeek-V3 and GPT-4o-mini generate logically correct and functional code, the models still make reasoning errors, including logical flaws (upper-right) and mathematical mistakes (bottom-right), highlighted in red.
  • Figure 4: Workflow of the ReMind architecture for robust code reasoning.
  • Figure 5: A case study of HumanEval/146 demonstrating how ReMind identifies and corrects a reasoning error through the collaboration of Mutator, Executor, and Inspector. Original code is generated by DeepSeek-V3, while o1-high serves as the underlying LLM of ReMind.