HumanEvalComm: Benchmarking the Communication Competence of Code Generation for LLMs and LLM Agent

Jie JW Wu; Fatemeh H Fard

HumanEvalComm: Benchmarking the Communication Competence of Code Generation for LLMs and LLM Agent

Jie JW Wu, Fatemeh H Fard

TL;DR

This paper introduces HumanEvalComm, a benchmark that evaluates the communication competence of Code LLMs and LLM-based agents in code generation by modifying HumanEval problem descriptions to be ambiguous, inconsistent, or incomplete. It defines two new metrics, Communication Rate and Good Question Rate, and uses an LLM-based evaluator to assess question quality, alongside traditional code-generation metrics like Pass@1 and Test Pass Rate. The authors propose Okanagan, a three-round LLM-agent that can generate code, ask clarifying questions, and then regenerate code with history, and show that Okanagan substantially improves communication metrics and often code quality on HumanEvalComm compared to Code LLMs, though it can over-ask on complete problems. They also examine the reliability of the LLM-based evaluator, the impact of prompt strategies and hyperparameters, and discuss limitations, threats to validity, and future directions, including more robust evaluation methods and broader datasets. Overall, this work highlights communication as a critical capability for AI-assisted software engineering and provides a foundation for evaluating and advancing interactive code-generation systems.

Abstract

Large language models (LLMs) have significantly improved their ability to perform tasks in the field of code generation. However, there is still a gap between LLMs being capable coders and being top-tier software engineers. Based on the observation that top-level software engineers often ask clarifying questions to reduce ambiguity in both requirements and coding solutions, we argue that the same should be applied to LLMs for code generation tasks. In this work, we conducted an empirical study on the benchmark and analysis of the communication skills of LLMs for code generation. We define communication skills of LLMs as ``being able to ask clarifying questions when the description of the code generation problem has issues''. We created a new benchmark, HumanEvalComm, by modifying problem descriptions according to three issues: inconsistency, ambiguity, incompleteness. We defined new evaluation metrics such as Communication Rate and Good Question Rate, and then experimented on HumanEvalComm with different Code LLMs, and a new LLM agent approach, Okanagan, to identify and ask questions in ambiguous parts from code and descriptions for further refining the generated code. Finally, we discussed evaluation results by comparing Code LLMs and Okanagan with our findings.

HumanEvalComm: Benchmarking the Communication Competence of Code Generation for LLMs and LLM Agent

TL;DR

Abstract

Paper Structure (30 sections, 2 equations, 6 figures, 16 tables)

This paper contains 30 sections, 2 equations, 6 figures, 16 tables.

Introduction
Benchmark Construction
Benchmark Collection
Evaluation Measurement
Empirical Study
Research Questions
Methodology Overview
Code Large Language Models
LLM-Agent Approach (Okanagan)
Experiment Setup
Results and Analysis
Communication Competency of Code LLMs on HumanEvalComm (RQ1)
Comparing Okanagan with Code LLMs in communication skills (RQ2)
Manual Evaluation of LLM-based Evaluator (RQ3)
Investigating Different Impacts of Prompt Strategies and Hyperparameters (RQ4)
...and 15 more sections

Figures (6)

Figure 1: The visual illustration of the methodology on HumanEvalComm benchmark (with statistics) and the evaluation of communication skills for Code LLMs and LLM Agent.
Figure 2: Flowchart for the evaluation of models, either Code LLMs or Okanagan (LLM agent), in communication capability.
Figure 3: An illustration of the process of Okanagan, an LLM agent approach.
Figure 4: Comparison of the effectiveness of the models in Communication Rate, Good Question Rate (left), and Pass@1, Test Pass Rate (right). Note that in the right figure, the stars represent the original performance of the corresponding model with the same color in the HumanEval benchmark. This shows visually how the performance has changed when the problem description is modified.
Figure 5: Comparison of Comm. Rate and Good Question Rate between Manual Evaluation and Automated Evaluation Across Models. Each row shows the resulting percentage of a model on a particular metric followed by the Kappa ($\kappa$) value in parentheses.
...and 1 more figures

HumanEvalComm: Benchmarking the Communication Competence of Code Generation for LLMs and LLM Agent

TL;DR

Abstract

HumanEvalComm: Benchmarking the Communication Competence of Code Generation for LLMs and LLM Agent

Authors

TL;DR

Abstract

Table of Contents

Figures (6)