Table of Contents
Fetching ...

Code Reasoning for Software Engineering Tasks: A Survey and A Call to Action

Saurabh Pujar, Ira Ceka, Irene Manotas, Gail Kaiser, Baishakhi Ray, Shyam Ramji

TL;DR

This paper addresses how code-specific reasoning techniques at inference time influence Software Engineering (SWE) tasks. It proposes a taxonomy encompassing Code Chain-of-Thought (CoT), Self-refinement, Inference Scaling, and SWE agents, and surveys their application across code generation, test generation, issue resolution, and reasoning/understanding benchmarks. The authors contribute the first dedicated survey on code-based reasoning in SWE, curate a broad set of benchmarks, and provide a cross-technique comparative analysis to identify effective design patterns. The findings suggest structure-aware and modular CoT approaches, especially when combined with inference scaling and agentic architectures, offer substantial gains and guide future research toward more realistic SWE workflows and benchmarks. This work lays a foundation for building more capable, reliable AI tools that assist with real-world software engineering tasks.

Abstract

The rise of large language models (LLMs) has led to dramatic improvements across a wide range of natural language tasks. Their performance on certain tasks can be further enhanced by incorporating test-time reasoning techniques. These inference-time advances have been adopted into the code domain, enabling complex software engineering (SWE) tasks such as code generation, test generation and issue resolution. However, the impact of different reasoning techniques on code-centric SWE tasks has not been systematically explored. In this work, we survey code reasoning techniques that underpin these capabilities, with a focus on test-time compute and inference-time reasoning paradigms. We examine a variety of code-specific reasoning methods and progressively build up to SWE agents, which combine planning, tool use, and multi-step interaction. We also compare the impact of different techniques on coding tasks, highlighting their relative importance and outlining open challenges and future research directions. Our contributions are: (1) to the best of our knowledge, the first dedicated survey of code reasoning for SWE tasks, highlighting overarching reasoning strategies, hybrid methods, and agentic approaches; (2) a taxonomy of inference-time techniques used to drive code reasoning, accompanied by a curated set of under-explored benchmarks with high potential for SWE evaluation; (3) a comparative analysis of reasoning design patterns across commonly used models and benchmarks; and (4) a synthesis of gaps in current methods and evaluation practices, identifying under-explored areas and concrete opportunities for future research.

Code Reasoning for Software Engineering Tasks: A Survey and A Call to Action

TL;DR

This paper addresses how code-specific reasoning techniques at inference time influence Software Engineering (SWE) tasks. It proposes a taxonomy encompassing Code Chain-of-Thought (CoT), Self-refinement, Inference Scaling, and SWE agents, and surveys their application across code generation, test generation, issue resolution, and reasoning/understanding benchmarks. The authors contribute the first dedicated survey on code-based reasoning in SWE, curate a broad set of benchmarks, and provide a cross-technique comparative analysis to identify effective design patterns. The findings suggest structure-aware and modular CoT approaches, especially when combined with inference scaling and agentic architectures, offer substantial gains and guide future research toward more realistic SWE workflows and benchmarks. This work lays a foundation for building more capable, reliable AI tools that assist with real-world software engineering tasks.

Abstract

The rise of large language models (LLMs) has led to dramatic improvements across a wide range of natural language tasks. Their performance on certain tasks can be further enhanced by incorporating test-time reasoning techniques. These inference-time advances have been adopted into the code domain, enabling complex software engineering (SWE) tasks such as code generation, test generation and issue resolution. However, the impact of different reasoning techniques on code-centric SWE tasks has not been systematically explored. In this work, we survey code reasoning techniques that underpin these capabilities, with a focus on test-time compute and inference-time reasoning paradigms. We examine a variety of code-specific reasoning methods and progressively build up to SWE agents, which combine planning, tool use, and multi-step interaction. We also compare the impact of different techniques on coding tasks, highlighting their relative importance and outlining open challenges and future research directions. Our contributions are: (1) to the best of our knowledge, the first dedicated survey of code reasoning for SWE tasks, highlighting overarching reasoning strategies, hybrid methods, and agentic approaches; (2) a taxonomy of inference-time techniques used to drive code reasoning, accompanied by a curated set of under-explored benchmarks with high potential for SWE evaluation; (3) a comparative analysis of reasoning design patterns across commonly used models and benchmarks; and (4) a synthesis of gaps in current methods and evaluation practices, identifying under-explored areas and concrete opportunities for future research.

Paper Structure

This paper contains 30 sections, 7 figures, 9 tables.

Figures (7)

  • Figure 1: A simplified view of LLM inference for code tasks, illustrating both standard decoding and test-time compute-based reasoning techniques. The different colored regions indicate the core components of the reasoning methods covered in this survey. In standard inference, the user sends a query to an LLM, which produces a single greedy output (typically with temperature $t = 0$); for coding tasks, this output is executed and the resulting behavior is evaluated. CoT (Sec. \ref{['sec:cot']}) is induced by augmenting the user query with a reasoning-oriented prompt template or an auxiliary LLM. Inference Scaling (Sec. \ref{['sec:inferencescaling']}) generates multiple outputs by sampling with temperature $t > 0$, and uses search over these candidates before evaluation. Self-refine (Sec. \ref{['sec:execution']}) iteratively improves candidate outputs by having an LLM critique and revise them; the critic may be the same model (self-reflection) or a separate reflector model. Finally, an Agent (Sec. \ref{['sec:agentssec']}) orchestrates LLM calls and manages the execution environment, including tool use for interacting with code and external systems.
  • Figure 2: Code Reasoning Taxonomy. We organize prior work on code reasoning along two axes: techniques and tasks. Techniques are methods that elicit, enhance, or exploit reasoning in LLMs. Tasks (often instantiated as benchmarks) are used to evaluate an LLM’s code-reasoning capability. Techniques are frequently combined within a single system; in this taxonomy, we categorize each publication by its dominant technique (i.e., the primary mechanism emphasized by the method). Table \ref{['tab:llm-hybrid']} highlights works that explicitly integrate multiple techniques.
  • Figure 3: Structure-aware CoT vs. planning-based CoT on benchmarks MBPP-S, MBPP+, MBPP, APPS, HE, and HE+ with models GPT-3.5, Llama3-8B-Instruct (L3-8B-Instr), GPT-4o mini (4o-mini) and DeepSeek-R1 (DS-R1). Modular CoT is a sub-category of Structure-aware CoT. Overall we see that Modular CoT outperforms code structure-based CoT, which outperforms plan-based CoT.
  • Figure 4: Comparison between code CoT and self-refinement techniques on code generation benchmarks HE+, HE, MBPP-ET, MBPP, APPS, and CodeContests with models Claude 3.5 Sonnet (Cl-3.5), GPT-3.5, and DeepSeek-Coder (DS). Code CoT is sub-categorized into plan-based and structure-aware CoT. Self-refinement techniques outperform code CoT based techniques.
  • Figure 5: Performance comparison between Inference Scaling, Code CoT and Self-refinement on Code Generation benchmarks of MBPP+, MBPP, APPS and LCB with models GPT-4o, GPT-4o mini (4o-mini), GPT-4 and OpenAI O1-mini (o1-mini). Inference Scaling outperforms code CoT based approaches. Inference scaling also outperforms self-refinement on MBPP+ with GPT-4o.
  • ...and 2 more figures