\texttt{ReMind}: Understanding Deductive Code Reasoning in LLMs
Jun Gao, Yun Peng, Xiaoxue Ren
TL;DR
The paper investigates why state-of-the-art LLMs struggle with deductive code reasoning, despite generating correct code. It identifies three core challenges: an intrinsic gap between generation and reasoning, self-execution bias toward the model’s own code, and poor zero-shot generalization on complex tasks. To address these, it proposes ReMind, a test-time, multi-agent framework consisting of Mutator, Executor, and Inspector that generate diverse code variants, trace execution, and validate/repair reasoning against CFGs. Across two benchmarks and five LLMs, ReMind delivers substantial accuracy gains, robust zero-shot generalization, and reduced sensitivity to code origin, at a modest increase in inference calls. The work offers a practical path toward reliable, interpretable program reasoning with LLMs, with broad implications for automated debugging, program verification, and AI-assisted development.
Abstract
Large Language Models (LLMs) have achieved remarkable progress in code-related tasks. Despite their advancement, empirical evidence reveals that they still struggle with \emph{deductive code reasoning}, the ability to reason about the program execution process. While prior studies have recognized this limitation, the underlying causes remain largely underexplored. In this paper, we begin by presenting a comprehensive empirical study that reveals three key challenges undermining deductive code reasoning: (1) an intrinsic gap between generation and reasoning abilities, (2) a consistent bias towards code sources, and (3) weak zero-shot generalization on complex benchmarks. In light of these challenges, we propose \texttt{ReMind}, a multi-agent framework composed of \texttt{Mutator}, \texttt{Executor}, and \texttt{Inspector}. The \texttt{Mutator} generates code variants to mitigate bias towards code sources, the \texttt{Executor} traces variable states step-by-step to expose inconsistency, and the \texttt{Inspector} identifies problematic reasoning steps and provides control-flow refinement to bridge the intrinsic reasoning gap. Through their coordinated collaboration, \texttt{ReMind} systematically identifies and refines reasoning flaws, achieving outstanding performance and enabling robust zero-shot generalization. Extensive experiments on two benchmarks with five LLMs demonstrate the superior advantages of \texttt{ReMind} compared to baseline approaches in deductive code reasoning.
