Table of Contents
Fetching ...

Copilot Arena: A Platform for Code LLM Evaluation in the Wild

Wayne Chi, Valerie Chen, Anastasios Nikolas Angelopoulos, Wei-Lin Chiang, Aditya Mittal, Naman Jain, Tianjun Zhang, Ion Stoica, Chris Donahue, Ameet Talwalkar

TL;DR

Copilot Arena introduces a live, IDE-integrated platform for collecting human preferences on code completions across 10 models in real developer workflows. It combines a novel head-to-head UI, latency-aware model sampling, and FiM-oriented prompting to produce a realistic, low-latency evaluation regime, and then builds a Bradley-Terry leaderboard from user judgments. The authors demonstrate that rankings derived from this in-the-wild setting differ from static benchmarks and chat-based evaluations, emphasizing the impact of task distribution and code context on model performance. By open-sourcing the platform and releasing a curated dataset, the work enables deeper, human-centered understanding of coding assistants and informs future evaluation methodologies in real-world software development environments.

Abstract

Evaluating in-the-wild coding capabilities of large language models (LLMs) is a challenging endeavor with no clear solution. We introduce Copilot Arena, a platform to collect user preferences for code generation through native integration into a developer's working environment. Copilot Arena comprises a novel interface for comparing pairs of model outputs, a sampling strategy optimized to reduce latency, and a prompting scheme to enable code completion functionality. Copilot Arena has served over 4.5 million suggestions from 10 models and collected over 11k pairwise judgements. Our results highlight the importance of model evaluations in integrated settings. We find that model rankings from Copilot Arena differ from those of existing evaluations, which we attribute to the more realistic distribution of data and tasks contained in Copilot Arena. We also identify novel insights into human preferences on code such as an observed consistency in user preference across programming languages yet significant variation in preference due to task category. We open-source Copilot Arena and release data to enable human-centric evaluations and improve understanding of coding assistants.

Copilot Arena: A Platform for Code LLM Evaluation in the Wild

TL;DR

Copilot Arena introduces a live, IDE-integrated platform for collecting human preferences on code completions across 10 models in real developer workflows. It combines a novel head-to-head UI, latency-aware model sampling, and FiM-oriented prompting to produce a realistic, low-latency evaluation regime, and then builds a Bradley-Terry leaderboard from user judgments. The authors demonstrate that rankings derived from this in-the-wild setting differ from static benchmarks and chat-based evaluations, emphasizing the impact of task distribution and code context on model performance. By open-sourcing the platform and releasing a curated dataset, the work enables deeper, human-centered understanding of coding assistants and informs future evaluation methodologies in real-world software development environments.

Abstract

Evaluating in-the-wild coding capabilities of large language models (LLMs) is a challenging endeavor with no clear solution. We introduce Copilot Arena, a platform to collect user preferences for code generation through native integration into a developer's working environment. Copilot Arena comprises a novel interface for comparing pairs of model outputs, a sampling strategy optimized to reduce latency, and a prompting scheme to enable code completion functionality. Copilot Arena has served over 4.5 million suggestions from 10 models and collected over 11k pairwise judgements. Our results highlight the importance of model evaluations in integrated settings. We find that model rankings from Copilot Arena differ from those of existing evaluations, which we attribute to the more realistic distribution of data and tasks contained in Copilot Arena. We also identify novel insights into human preferences on code such as an observed consistency in user preference across programming languages yet significant variation in preference due to task category. We open-source Copilot Arena and release data to enable human-centric evaluations and improve understanding of coding assistants.

Paper Structure

This paper contains 29 sections, 6 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 1: Copilot Arena is a platform for conducting realistic evaluations of code LLMs, collecting human preferences of coding models with real users, real tasks, and in realistic environments, aimed at addressing the limitations of existing evaluations.
  • Figure 2: We introduce Copilot Arena , a VSCode extension to collect human preferences of code directly in a developer's IDE. Copilot Arena enables developers to use code completions from various models. The system comprises a) the interface in the user's IDE which presents paired completions to users (left), b) a sampling strategy that picks model pairs to reduce latency (right, top), and c) a prompting scheme that allows diverse LLMs to perform code completions with high fidelity. Users can select between the top completion (green box) using tab or the bottom completion (blue box) using shift+tab.
  • Figure 3: The likelihood of users accepting one of the two completions as a function of empirical pairwise latency (determined by the slower of the two models). As latency increases, users are less likely to accept a completion. We devise a sampling strategy described in Section \ref{['subsec:sampling']} which reduces pairwise latency by 33% while also ensuring sufficient coverage of unique model pairs.
  • Figure 4: We evaluate the effectiveness of our prompting scheme by comparing LLM performance on infilling tasks (using pass@1) before and after applying it. We evaluate 9 different models of varying performance across 4 different prompt templates (i.e., ways of encoding the prefix and suffix in the prompt): each point represents one model and one prompt template pair. We observe that, across the board, the overwhelming majority of pairs benefit from our prompting scheme (e.g., lie above the diagonal line).
  • Figure 5: We compare model rankings in Copilot Arena (1st column) to existing evaluations, both static benchmarks (2nd-4th column) and live preference evaluations (last two columns). For existing evaluations, we show the change in rank relative to Copilot Arena rank, with positive values in green denoting models performing better on existing evaluations, negative values in red denoting models performing worse, and a dash indicating that the model is not present in the live leaderboard. We also report the Spearman rank correlation coefficients between Copilot Arena and other leaderboards.
  • ...and 11 more figures