Table of Contents
Fetching ...

When Benchmarks Talk: Re-Evaluating Code LLMs with Interactive Feedback

Jane Pan, Ryan Shar, Jacob Pfau, Ameet Talwalkar, He He, Valerie Chen

TL;DR

This work argues that static benchmarks inadequately capture how code LLMs collaborate with users in real programming tasks. It presents an interactive evaluation pipeline that converts underspecified static questions into collaborative problems via input obfuscation and a simulated user, testing four feedback types to study model behavior across three datasets and ten models. Findings show that interactivity can reorder model rankings, with Code Feedback and Paragraph feedback delivering the largest performance gains, while feedback quality influences steerability and behavior in nuanced ways. The authors provide insights into how models incorporate feedback and propose an open-source pipeline to bridge the gap between traditional benchmarks and real-world usage in coding assistants.

Abstract

Programming is a fundamentally interactive process, yet coding assistants are often evaluated using static benchmarks that fail to measure how well models collaborate with users. We introduce an interactive evaluation pipeline to examine how LLMs incorporate different types of feedback in a collaborative setting. Specifically, we perturb static coding benchmarks so that the code model must interact with a simulated user to retrieve key information about the problem. We find that interaction significantly affects model performance, as the relative rankings of 10 models across 3 datasets often vary between static and interactive settings, despite models being fairly robust to feedback that contains errors. We also observe that even when different feedback types are equally effective with respect to performance, they can impact model behaviors such as (1) how models respond to higher- vs. lower-quality feedback and (2) whether models prioritize aesthetic vs. functional edits. Our work aims to "re-evaluate" model coding capabilities through an interactive lens toward bridging the gap between existing evaluations and real-world usage.

When Benchmarks Talk: Re-Evaluating Code LLMs with Interactive Feedback

TL;DR

This work argues that static benchmarks inadequately capture how code LLMs collaborate with users in real programming tasks. It presents an interactive evaluation pipeline that converts underspecified static questions into collaborative problems via input obfuscation and a simulated user, testing four feedback types to study model behavior across three datasets and ten models. Findings show that interactivity can reorder model rankings, with Code Feedback and Paragraph feedback delivering the largest performance gains, while feedback quality influences steerability and behavior in nuanced ways. The authors provide insights into how models incorporate feedback and propose an open-source pipeline to bridge the gap between traditional benchmarks and real-world usage in coding assistants.

Abstract

Programming is a fundamentally interactive process, yet coding assistants are often evaluated using static benchmarks that fail to measure how well models collaborate with users. We introduce an interactive evaluation pipeline to examine how LLMs incorporate different types of feedback in a collaborative setting. Specifically, we perturb static coding benchmarks so that the code model must interact with a simulated user to retrieve key information about the problem. We find that interaction significantly affects model performance, as the relative rankings of 10 models across 3 datasets often vary between static and interactive settings, despite models being fairly robust to feedback that contains errors. We also observe that even when different feedback types are equally effective with respect to performance, they can impact model behaviors such as (1) how models respond to higher- vs. lower-quality feedback and (2) whether models prioritize aesthetic vs. functional edits. Our work aims to "re-evaluate" model coding capabilities through an interactive lens toward bridging the gap between existing evaluations and real-world usage.

Paper Structure

This paper contains 46 sections, 7 equations, 18 figures, 14 tables.

Figures (18)

  • Figure 1: While most existing benchmarks statically evaluate LLM coding capabilities, code LLMs are used interactively in practice. We introduce an evaluation pipeline that evaluates code model s in an interactive setting (top). Across three datasets, such as LiveCodeBench (bottom), we find that interactively evaluating models with different feedback types (Code Feedback, Query Rephrasing, Paragraph, and Sentence) leads to different rankings when compared to static evaluation.
  • Figure 2: Overview of our interactive pipeline for coding evaluation. (A) We obfuscate the input of existing fully specified datasets to reflect how programmers tend to underspecify requests to LLMs (e.g., via docstrings or comments) in practice. (B) As developers may interact with models in a variety of ways, we explore $4$ different feedback types and introduce a pipeline that mimics the iterative refinement loop that programmers often use with chat models, (C) where the code model generates a solution using feedback on its previous solution, (D) and the user provides updated feedback to the code model.
  • Figure 3: Rank changes between static and interactive settings across $3$ datasets--- APPS, LiveCodeBench, and ClassEval. We stratify interactive settings by feedback type (Code Feedback, Query Rephrasing, Paragraph, and Sentence), and observe changes in rankings across all datasets and interactive settings.
  • Figure 4: Distribution of performance change across feedback types and directional correctness. We split solutions into post-feedback performance gains (green) or losses (red) and observe that models can still benefit from directionally incorrect feedback, and that directionally correct Code Feedback sometimes increases the rate of post-feedback performance loss.
  • Figure 5: Behavioral-level (top) and surface-level (bottom) steerability by feedback type, averaged across all models for APPS and LiveCodeBench. Paragraph feedback induces the most changes at both levels, while Code Feedback leads to more behavioral changes with less aesthetic changes.
  • ...and 13 more figures