Table of Contents
Fetching ...

ConvCodeWorld: Benchmarking Conversational Code Generation in Reproducible Feedback Environments

Hojae Han, Seung-won Hwang, Rajhans Samdani, Yuxiong He

TL;DR

ConvCodeWorld introduces a reproducible, multi-turn benchmark for interactive code generation by systematically combining compilation, execution (partial/full) and verbal feedback, enabling robust evaluation across nine realistic feedback scenarios. To address cost and reproducibility, ConvCodeBench provides a static, log-based proxy with strong rank correlations to the live benchmark, reducing reliance on costly LLM calls. Across 21+ models, the study shows that feedback type and test coverage significantly shape model performance, with weaker models sometimes outperforming state-of-the-art single-turn baselines when given rich feedback, and generalization proving challenging for unseen feedback combinations. The work offers insights into the trade-offs between Mean Reciprocal Rank and Recall, highlights the strong role of expert feedback in narrowing gaps, and provides publicly available benchmarks to accelerate research in reproducible, feedback-driven code generation.

Abstract

Large language models (LLMs) have proven invaluable for code generation, particularly in interactive settings. However, existing code generation benchmarks fail to capture the diverse feedback encountered in multi-turn interactions, limiting our ability to evaluate LLMs in these contexts. To address this gap, we present a set of novel benchmarks that explicitly model the quality of feedback provided to code generation LLMs. Our contributions are threefold: First, we introduce CONVCODEWORLD, a novel and reproducible environment for benchmarking interactive code generation. CONVCODEWORLD simulates 9 distinct interactive code generation scenarios while systematically combining three types of feedback: (a) compilation feedback; (b) execution feedback with varying test coverage; (c) verbal feedback generated by GPT-4o with different levels of expertise. Second, we introduce CONVCODEBENCH, a fast, static version of benchmark that uses pre-generated feedback logs, eliminating the need for costly dynamic verbal feedback generation while maintaining strong Spearman's rank correlations (0.82 to 0.99) with CONVCODEWORLD. Third, extensive evaluations of both closed-source and open-source LLMs including R1-Distill on CONVCODEWORLD reveal key insights: (a) LLM performance varies significantly based on the feedback provided; (b) Weaker LLMs, with sufficient feedback, can outperform single-turn results of state-of-the-art LLMs without feedback; (c) Training on a specific feedback combination can limit an LLM's ability to utilize unseen combinations; (d) LLMs solve problems in fewer turns (high MRR) may not solve as many problems overall (high Recall), and vice versa. All implementations and benchmarks will be made publicly available at https://huggingface.co/spaces/ConvCodeWorld/ConvCodeWorld

ConvCodeWorld: Benchmarking Conversational Code Generation in Reproducible Feedback Environments

TL;DR

ConvCodeWorld introduces a reproducible, multi-turn benchmark for interactive code generation by systematically combining compilation, execution (partial/full) and verbal feedback, enabling robust evaluation across nine realistic feedback scenarios. To address cost and reproducibility, ConvCodeBench provides a static, log-based proxy with strong rank correlations to the live benchmark, reducing reliance on costly LLM calls. Across 21+ models, the study shows that feedback type and test coverage significantly shape model performance, with weaker models sometimes outperforming state-of-the-art single-turn baselines when given rich feedback, and generalization proving challenging for unseen feedback combinations. The work offers insights into the trade-offs between Mean Reciprocal Rank and Recall, highlights the strong role of expert feedback in narrowing gaps, and provides publicly available benchmarks to accelerate research in reproducible, feedback-driven code generation.

Abstract

Large language models (LLMs) have proven invaluable for code generation, particularly in interactive settings. However, existing code generation benchmarks fail to capture the diverse feedback encountered in multi-turn interactions, limiting our ability to evaluate LLMs in these contexts. To address this gap, we present a set of novel benchmarks that explicitly model the quality of feedback provided to code generation LLMs. Our contributions are threefold: First, we introduce CONVCODEWORLD, a novel and reproducible environment for benchmarking interactive code generation. CONVCODEWORLD simulates 9 distinct interactive code generation scenarios while systematically combining three types of feedback: (a) compilation feedback; (b) execution feedback with varying test coverage; (c) verbal feedback generated by GPT-4o with different levels of expertise. Second, we introduce CONVCODEBENCH, a fast, static version of benchmark that uses pre-generated feedback logs, eliminating the need for costly dynamic verbal feedback generation while maintaining strong Spearman's rank correlations (0.82 to 0.99) with CONVCODEWORLD. Third, extensive evaluations of both closed-source and open-source LLMs including R1-Distill on CONVCODEWORLD reveal key insights: (a) LLM performance varies significantly based on the feedback provided; (b) Weaker LLMs, with sufficient feedback, can outperform single-turn results of state-of-the-art LLMs without feedback; (c) Training on a specific feedback combination can limit an LLM's ability to utilize unseen combinations; (d) LLMs solve problems in fewer turns (high MRR) may not solve as many problems overall (high Recall), and vice versa. All implementations and benchmarks will be made publicly available at https://huggingface.co/spaces/ConvCodeWorld/ConvCodeWorld

Paper Structure

This paper contains 52 sections, 3 equations, 21 figures, 17 tables.

Figures (21)

  • Figure 1: (Left)ConvCodeWorld is a dynamic, reproducible environment that simulates nine distinct feedback scenarios by combining three types of feedback. (Right)ConvCodeBench is a static version of the benchmark that uses pre-generated logs and strongly correlates with ConvCodeWorld. Together, these frameworks provide a comprehensive, cost-effective approach for evaluating LLMs in multi-turn, feedback-driven code generation, enabling scalable and consistent benchmarking across diverse feedback combinations.
  • Figure 2: Correlation between MRR on ConvCodeBench (ref. CodeLlama-7B-Instruct) and MRR on ConvCodeWorld with different feedback combinations $\Omega$.
  • Figure 3: Iterative Pass@$1$ results on ConvCodeWorld with different feedback combinations $\Omega$.
  • Figure 4: Iterative Pass@$1$ results of each LLM on ConvCodeWorld with different feedback combinations $\Omega$ (continued on Figure \ref{['fig:live_pass_at_1_per_model_2']}).
  • Figure 5: Iterative Pass@$1$ results of each LLM on ConvCodeWorld with different feedback combinations $\Omega$ (continued from Figure \ref{['fig:live_pass_at_1_per_model']}).
  • ...and 16 more figures