Table of Contents
Fetching ...

ClarifyGPT: Empowering LLM-based Code Generation with Intention Clarification

Fangwen Mu, Lin Shi, Song Wang, Zhuohao Yu, Binquan Zhang, Chenxue Wang, Shichao Liu, Qing Wang

TL;DR

ClarifyGPT addresses the ambiguity problem in LLM-based code generation by detecting ambiguous requirements through a code-consistency check and soliciting targeted clarifications via reasoning-based prompts. The approach comprises test input generation, code-consistency checks, and an interactive clarification loop that refines the prompt before final code generation. Experiments across MBPP-sanitized, MBPP-ET, HumanEval and its extended variants show substantial Pass@1 gains for GPT-4 and ChatGPT, with real-user and simulated feedback confirming robustness. A high-fidelity user simulation enables automated evaluation at scale, and a public dataset and code release support replication. The work suggests a practical path to improving real-world LLM-driven development.

Abstract

We introduce a novel framework named ClarifyGPT, which aims to enhance code generation by empowering LLMs with the ability to identify ambiguous requirements and ask targeted clarifying questions. In particular, ClarifyGPT first detects whether a given requirement is ambiguous by performing a code consistency check. If it is ambiguous, ClarifyGPT prompts an LLM to generate targeted clarifying questions. After receiving question responses, ClarifyGPT refines the ambiguous requirement and inputs it into the same LLM to generate a final code solution. To evaluate our ClarifyGPT, we first conduct a human evaluation involving ten participants who use ClarifyGPT for code generation on two publicly available benchmarks: MBPP-sanitized and MBPP-ET. The results show that ClarifyGPT elevates the performance (Pass@1) of GPT-4 from 70.96% to 80.80% on MBPP-sanitized. Furthermore, to perform large-scale automated evaluations of ClarifyGPT across different LLMs and benchmarks without requiring user participation, we introduce a high-fidelity simulation method to simulate user responses. The automated evaluation results also demonstrate that ClarifyGPT can significantly enhance code generation performance compared to the baselines. In particular, ClarifyGPT improves the average performance of GPT-4 and ChatGPT across four benchmarks from 68.02% to 75.75% and from 58.55% to 67.22%, respectively. We believe that ClarifyGPT can effectively facilitate the practical application of LLMs in real-world development environments.

ClarifyGPT: Empowering LLM-based Code Generation with Intention Clarification

TL;DR

ClarifyGPT addresses the ambiguity problem in LLM-based code generation by detecting ambiguous requirements through a code-consistency check and soliciting targeted clarifications via reasoning-based prompts. The approach comprises test input generation, code-consistency checks, and an interactive clarification loop that refines the prompt before final code generation. Experiments across MBPP-sanitized, MBPP-ET, HumanEval and its extended variants show substantial Pass@1 gains for GPT-4 and ChatGPT, with real-user and simulated feedback confirming robustness. A high-fidelity user simulation enables automated evaluation at scale, and a public dataset and code release support replication. The work suggests a practical path to improving real-world LLM-driven development.

Abstract

We introduce a novel framework named ClarifyGPT, which aims to enhance code generation by empowering LLMs with the ability to identify ambiguous requirements and ask targeted clarifying questions. In particular, ClarifyGPT first detects whether a given requirement is ambiguous by performing a code consistency check. If it is ambiguous, ClarifyGPT prompts an LLM to generate targeted clarifying questions. After receiving question responses, ClarifyGPT refines the ambiguous requirement and inputs it into the same LLM to generate a final code solution. To evaluate our ClarifyGPT, we first conduct a human evaluation involving ten participants who use ClarifyGPT for code generation on two publicly available benchmarks: MBPP-sanitized and MBPP-ET. The results show that ClarifyGPT elevates the performance (Pass@1) of GPT-4 from 70.96% to 80.80% on MBPP-sanitized. Furthermore, to perform large-scale automated evaluations of ClarifyGPT across different LLMs and benchmarks without requiring user participation, we introduce a high-fidelity simulation method to simulate user responses. The automated evaluation results also demonstrate that ClarifyGPT can significantly enhance code generation performance compared to the baselines. In particular, ClarifyGPT improves the average performance of GPT-4 and ChatGPT across four benchmarks from 68.02% to 75.75% and from 58.55% to 67.22%, respectively. We believe that ClarifyGPT can effectively facilitate the practical application of LLMs in real-world development environments.
Paper Structure (27 sections, 4 figures, 4 tables)

This paper contains 27 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The Overview of ClarifyGPT
  • Figure 2: List of basic type-aware mutations over input $x$DBLP:journals/corr/abs-2305-01210
  • Figure 3: The details of the prompts used in ClarifyGPT
  • Figure 4: Two real cases from HumanEval and MBPP generated by two baselines and our ClarifyGPT.