Table of Contents
Fetching ...

What Makes Large Language Models Reason in (Multi-Turn) Code Generation?

Kunhao Zheng, Juliette Decugis, Jonas Gehring, Taco Cohen, Benjamin Negrevergne, Gabriel Synnaeve

TL;DR

This work systematically analyzes how reasoning, instruction, and execution-feedback prompts shape multi-turn code generation across large language models and competitive-programming benchmarks. By performing a large-scale grid search and introducing CoT-retry and Rejection Sampling Fine-tuning, the authors identify where prompting strategies improve performance and how models can internalize reasoning through finetuning. The findings show that single-turn CoT helps hard problems and large models, while multi-turn benefits from adaptive reasoning prompts, simple execution feedback, and, crucially, that RFT enables CoT-level reasoning without explicit prompts during inference. The study advances understanding of how to balance reasoning and code generation under compute constraints and points toward deployable, environment-aware code-generation agents.

Abstract

Prompting techniques such as chain-of-thought have established themselves as a popular vehicle for improving the outputs of large language models (LLMs). For code generation, however, their exact mechanics and efficacy are under-explored. We thus investigate the effects of a wide range of prompting strategies with a focus on automatic re-prompting over multiple turns and computational requirements. After systematically decomposing reasoning, instruction, and execution feedback prompts, we conduct an extensive grid search on the competitive programming benchmarks CodeContests and TACO for multiple LLM families and sizes (Llama 3.0 and 3.1, 8B, 70B, 405B, and GPT-4o). Our study reveals strategies that consistently improve performance across all models with small and large sampling budgets. We then show how finetuning with such an optimal configuration allows models to internalize the induced reasoning process and obtain improvements in performance and scalability for multi-turn code generation.

What Makes Large Language Models Reason in (Multi-Turn) Code Generation?

TL;DR

This work systematically analyzes how reasoning, instruction, and execution-feedback prompts shape multi-turn code generation across large language models and competitive-programming benchmarks. By performing a large-scale grid search and introducing CoT-retry and Rejection Sampling Fine-tuning, the authors identify where prompting strategies improve performance and how models can internalize reasoning through finetuning. The findings show that single-turn CoT helps hard problems and large models, while multi-turn benefits from adaptive reasoning prompts, simple execution feedback, and, crucially, that RFT enables CoT-level reasoning without explicit prompts during inference. The study advances understanding of how to balance reasoning and code generation under compute constraints and points toward deployable, environment-aware code-generation agents.

Abstract

Prompting techniques such as chain-of-thought have established themselves as a popular vehicle for improving the outputs of large language models (LLMs). For code generation, however, their exact mechanics and efficacy are under-explored. We thus investigate the effects of a wide range of prompting strategies with a focus on automatic re-prompting over multiple turns and computational requirements. After systematically decomposing reasoning, instruction, and execution feedback prompts, we conduct an extensive grid search on the competitive programming benchmarks CodeContests and TACO for multiple LLM families and sizes (Llama 3.0 and 3.1, 8B, 70B, 405B, and GPT-4o). Our study reveals strategies that consistently improve performance across all models with small and large sampling budgets. We then show how finetuning with such an optimal configuration allows models to internalize the induced reasoning process and obtain improvements in performance and scalability for multi-turn code generation.

Paper Structure

This paper contains 54 sections, 2 equations, 22 figures, 13 tables.

Figures (22)

  • Figure 1: Our framework for evaluating LLM multi-turn code generation techniques.Top: In the default multi-turn setting, given a programming problem, the model generates a code solution, interacts with the runtime environment to gather execution feedback and retries in case of failure. Bottom: On top of the default setting, we gather reasoning (Reason.) prompts, instruction (Inst.) prompts, and execution feedback prompts. The problem statement is augmented with a reasoning prompt. After generating an answer to the reasoning prompt, an instruction prompt determines how program code should be generated. The execution feedback prompts vary in granularity, ranging from a binary pass or fail indicator to detailed tracing information.
  • Figure 2: Scaling number of turns is not compute optimal. Pass@$10$ (Top) and pass $10$@$100$ (Bottom) on CodeContests test set when increasing the number of turns with Llama 3.1 70B.
  • Figure 3: Prompting space explored in our survey. We explore chain of thought prompts at three different levels: before the first code generation (reasoning prompts), with code generation (instruction prompts), and after the first code generation (execution feedback). The corresponding works from the single-turn and multi-turn reasoning and code generation literature are: [1] CoA, [2] self_discover, [3] khot2022decomposed, [4] parsel, [5] jain2023llm, [6] Ldb, [7] ni2024next, [8] self-debug, [9] le2023codechain, [10] madaan2024self, [11] paul2023refiner, [12] tang2024coderepairllmsgives, [13] li2023explaining.
  • Figure 4: CoT helps most on hard examples. From a set of 8 reasoning and 6 instruction prompts commonly used on competitive coding benchmarks, we extract the pass rate of the best and worst prompts amongst all $63 = (8+1) \times (6+1)$ combinations (including no reasoning or no instruction) for Llama 3.0 8B. We compare on different difficulty split of the TACO dataset. The relative gain from a tailored CoT increases with problem difficulty and sampling size.
  • Figure 5: Solution-based instruction prompts work best across Llama 3.1 models. We separate instruction prompts into "describe" (e.g., add comments, imports), "modularity" (e.g., add helper functions) and "solution"(e.g., write a naive solution, propose a clever algorithm). The performance difference ($\Delta$) is normalized with respect to the baseline and standard deviation per pass rate.
  • ...and 17 more figures