When Benchmarks Talk: Re-Evaluating Code LLMs with Interactive Feedback
Jane Pan, Ryan Shar, Jacob Pfau, Ameet Talwalkar, He He, Valerie Chen
TL;DR
This work argues that static benchmarks inadequately capture how code LLMs collaborate with users in real programming tasks. It presents an interactive evaluation pipeline that converts underspecified static questions into collaborative problems via input obfuscation and a simulated user, testing four feedback types to study model behavior across three datasets and ten models. Findings show that interactivity can reorder model rankings, with Code Feedback and Paragraph feedback delivering the largest performance gains, while feedback quality influences steerability and behavior in nuanced ways. The authors provide insights into how models incorporate feedback and propose an open-source pipeline to bridge the gap between traditional benchmarks and real-world usage in coding assistants.
Abstract
Programming is a fundamentally interactive process, yet coding assistants are often evaluated using static benchmarks that fail to measure how well models collaborate with users. We introduce an interactive evaluation pipeline to examine how LLMs incorporate different types of feedback in a collaborative setting. Specifically, we perturb static coding benchmarks so that the code model must interact with a simulated user to retrieve key information about the problem. We find that interaction significantly affects model performance, as the relative rankings of 10 models across 3 datasets often vary between static and interactive settings, despite models being fairly robust to feedback that contains errors. We also observe that even when different feedback types are equally effective with respect to performance, they can impact model behaviors such as (1) how models respond to higher- vs. lower-quality feedback and (2) whether models prioritize aesthetic vs. functional edits. Our work aims to "re-evaluate" model coding capabilities through an interactive lens toward bridging the gap between existing evaluations and real-world usage.
