Table of Contents
Fetching ...

Unveiling the Magic of Code Reasoning through Hypothesis Decomposition and Amendment

Yuze Zhao, Tianyun Ji, Wenjun Feng, Zhenya Huang, Qi Liu, Zhiding Liu, Yixiao Ma, Kai Zhang, Enhong Chen

TL;DR

The paper formalizes code reasoning as a task that integrates memory recall and logical reasoning by defining three meta-benchmarks—inductive, deductive, and abductive—and instantiating them into eight benchmarks. It introduces the Reflective Hypothesis Decomposition and Amendment (RHDA) pipeline, which decomposes problem hypotheses, verifies execution via external tools, and refines sub-hypotheses through amendment submissions to mitigate reasoning failures. Across inductive, deductive, and abductive settings, RHDA delivers substantial improvements—up to about threefold gains in performance—while remaining compatible with diverse LLMs and scalable to complex tasks like VirtualHome. The work includes extensive experiments, ablations, and qualitative analyses to illustrate the mechanism by which decomposition and reflection improve reasoning, and it provides reproducibility resources by releasing code and benchmarks. Overall, RHDA represents a general, pipeline-agnostic approach to enhancing code-based reasoning in LLMs, with significant implications for reliable multimodal and real-world problem solving.

Abstract

The reasoning abilities are one of the most enigmatic and captivating aspects of large language models (LLMs). Numerous studies are dedicated to exploring and expanding the boundaries of this reasoning capability. However, tasks that embody both reasoning and recall characteristics are often overlooked. In this paper, we introduce such a novel task, code reasoning, to provide a new perspective for the reasoning abilities of LLMs. We summarize three meta-benchmarks based on established forms of logical reasoning, and instantiate these into eight specific benchmark tasks. Our testing on these benchmarks reveals that LLMs continue to struggle with identifying satisfactory reasoning pathways. Additionally, we present a new pathway exploration pipeline inspired by human intricate problem-solving methods. This Reflective Hypothesis Decomposition and Amendment (RHDA) pipeline consists of the following iterative steps: (1) Proposing potential hypotheses based on observations and decomposing them; (2) Utilizing tools to validate hypotheses and reflection outcomes; (3) Revising hypothesis in light of observations. Our approach effectively mitigates logical chain collapses arising from forgetting or hallucination issues in multi-step reasoning, resulting in performance gains of up to $3\times$. Finally, we expanded this pipeline by applying it to simulate complex household tasks in real-world scenarios, specifically in VirtualHome, enhancing the handling of failure cases. We release our code and all of results at https://github.com/TnTWoW/code_reasoning.

Unveiling the Magic of Code Reasoning through Hypothesis Decomposition and Amendment

TL;DR

The paper formalizes code reasoning as a task that integrates memory recall and logical reasoning by defining three meta-benchmarks—inductive, deductive, and abductive—and instantiating them into eight benchmarks. It introduces the Reflective Hypothesis Decomposition and Amendment (RHDA) pipeline, which decomposes problem hypotheses, verifies execution via external tools, and refines sub-hypotheses through amendment submissions to mitigate reasoning failures. Across inductive, deductive, and abductive settings, RHDA delivers substantial improvements—up to about threefold gains in performance—while remaining compatible with diverse LLMs and scalable to complex tasks like VirtualHome. The work includes extensive experiments, ablations, and qualitative analyses to illustrate the mechanism by which decomposition and reflection improve reasoning, and it provides reproducibility resources by releasing code and benchmarks. Overall, RHDA represents a general, pipeline-agnostic approach to enhancing code-based reasoning in LLMs, with significant implications for reliable multimodal and real-world problem solving.

Abstract

The reasoning abilities are one of the most enigmatic and captivating aspects of large language models (LLMs). Numerous studies are dedicated to exploring and expanding the boundaries of this reasoning capability. However, tasks that embody both reasoning and recall characteristics are often overlooked. In this paper, we introduce such a novel task, code reasoning, to provide a new perspective for the reasoning abilities of LLMs. We summarize three meta-benchmarks based on established forms of logical reasoning, and instantiate these into eight specific benchmark tasks. Our testing on these benchmarks reveals that LLMs continue to struggle with identifying satisfactory reasoning pathways. Additionally, we present a new pathway exploration pipeline inspired by human intricate problem-solving methods. This Reflective Hypothesis Decomposition and Amendment (RHDA) pipeline consists of the following iterative steps: (1) Proposing potential hypotheses based on observations and decomposing them; (2) Utilizing tools to validate hypotheses and reflection outcomes; (3) Revising hypothesis in light of observations. Our approach effectively mitigates logical chain collapses arising from forgetting or hallucination issues in multi-step reasoning, resulting in performance gains of up to . Finally, we expanded this pipeline by applying it to simulate complex household tasks in real-world scenarios, specifically in VirtualHome, enhancing the handling of failure cases. We release our code and all of results at https://github.com/TnTWoW/code_reasoning.

Paper Structure

This paper contains 50 sections, 4 equations, 9 figures, 16 tables.

Figures (9)

  • Figure 1: Code reasoning is a category of tasks that incorporates logical reasoning into code, aiming to solve programming problems through logical reasoning. These tasks require a balance between background knowledge and thinking span, placing greater emphasis on the collaborative functioning of both System 1 and System 2 thinking.
  • Figure 2: An overview of pipeline to solve code reasoning task. We decompose the hypothesis and generate executable functions step by step. After comparing the results with the seen observations and receiving feedback, we propose amendments, reflect on potential errors at each step, and generate revised hypotheses. This process is repeated until a valid problem-solving pathway is discovered. For concise expression, we show partial code snippets.
  • Figure 3: RHDA method on abductive code reasoning task. $T$ refers to the maximum number of iterations. $N$ refers to the number of candidates.
  • Figure 4: We demonstrate how RHDA can be extended to the VirtualHome framework to successfully complete the task of storing the pie in fridge.
  • Figure 5: The DSL syntax for string manipulation tasks in the RobustFill domain.
  • ...and 4 more figures