The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers

Hussein Mozannar; Valerie Chen; Mohammed Alsobay; Subhro Das; Sebastian Zhao; Dennis Wei; Manish Nagireddy; Prasanna Sattigeri; Ameet Talwalkar; David Sontag

The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers

Hussein Mozannar, Valerie Chen, Mohammed Alsobay, Subhro Das, Sebastian Zhao, Dennis Wei, Manish Nagireddy, Prasanna Sattigeri, Ameet Talwalkar, David Sontag

TL;DR

Despite static benchmarks not incorporating humans-in-the-loop, it is found that improvements in benchmark performance lead to increased programmer productivity; however gaps in benchmark versus human performance are not proportional -- a trend that holds across both forms of LLM support.

Abstract

Evaluation of large language models for code has primarily relied on static benchmarks, including HumanEval (Chen et al., 2021), or more recently using human preferences of LLM responses. As LLMs are increasingly used as programmer assistants, we study whether gains on existing benchmarks or more preferred LLM responses translate to programmer productivity when coding with LLMs, including time spent coding. We introduce RealHumanEval, a web interface to measure the ability of LLMs to assist programmers, through either autocomplete or chat support. We conducted a user study (N=243) using RealHumanEval in which users interacted with seven LLMs of varying base model performance. Despite static benchmarks not incorporating humans-in-the-loop, we find that improvements in benchmark performance lead to increased programmer productivity; however gaps in benchmark versus human performance are not proportional -- a trend that holds across both forms of LLM support. In contrast, we find that programmer preferences do not correlate with their actual performance, motivating the need for better proxy signals. We open-source RealHumanEval to enable human-centric evaluation of new models and the study data to facilitate efforts to improve code models.

The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers

TL;DR

Abstract

Paper Structure (31 sections, 1 equation, 16 figures, 1 table)

This paper contains 31 sections, 1 equation, 16 figures, 1 table.

User study details
RealHumanEval interface screenshots
User Study Instructions
Autocomplete Condition
Chat Condition
No LLM Condition
Post-Study Questionnaire
Data release considerations
Task Design
Task categories
Algorithmic coding problems:
Editing and augmenting existing code:
Data science tasks:
Task organization
LLM Details
...and 16 more sections

Figures (16)

Figure 1: Screenshot of the autocomplete LLM-assistance interface in our user study.
Figure 2: Another screenshot of the autocomplete LLM-assistance interface in our user study.
Figure 3: Screenshot of the chat LLM-assistance interface in our user study.
Figure 4: Pass@1 of LLM models and their chat variants on two canonical benchmarks, HumanEval and MBPP (results from roziere2023codeliu2023your), showing that CodeLlama-7b models perform worse than CodeLlama-34b models, which are less performant than GPT-3.5 models.
Figure 5: Comparing the acceptance rate for when participants requested suggestions with when they were automatically provided with suggestions by the autocomplete system.
...and 11 more figures

The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers

TL;DR

Abstract

The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers

Authors

TL;DR

Abstract

Table of Contents

Figures (16)