LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation

Sarah Fakhoury; Aaditya Naik; Georgios Sakkas; Saikat Chakraborty; Shuvendu K. Lahiri

LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation

Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, Shuvendu K. Lahiri

TL;DR

The paper tackles the challenge of ambiguity in natural-language specifications when generating code with LLMs. It introduces TiCoder, an interactive workflow that uses automatically generated tests to clarify user intent and prune/rank code suggestions, yielding more correct code and reduced cognitive load. In a mixed-methods study, TiCoder substantially improves task correctness and reduces cognitive effort; in large-scale benchmarks across MBPP and HumanEval, TiCoder significantly boosts pass@1@m for multiple LLMs, with TiCoder-Output delivering the strongest gains and enabling smaller models to approach the performance of larger models. The work demonstrates the practical value of execution-based, test-driven disambiguation for AI-assisted programming and outlines future directions toward richer specifications and broader task domains.

Abstract

Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, given NL is informal, it does not lend easily to checking that the generated code correctly satisfies the user intent. In this paper, we propose a novel interactive workflow TiCoder for guided intent clarification (i.e., partial formalization) through tests to support the generation of more accurate code suggestions. Through a mixed methods user study with 15 programmers, we present an empirical evaluation of the effectiveness of the workflow to improve code generation accuracy. We find that participants using the proposed workflow are significantly more likely to correctly evaluate AI generated code, and report significantly less task-induced cognitive load. Furthermore, we test the potential of the workflow at scale with four different state-of-the-art LLMs on two python datasets, using an idealized proxy for a user feedback. We observe an average absolute improvement of 45.97% in the pass@1 code generation accuracy for both datasets and across all LLMs within 5 user interactions, in addition to the automatic generation of accompanying unit tests.

LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation

TL;DR

Abstract

Paper Structure (39 sections, 1 equation, 4 figures, 5 tables)

This paper contains 39 sections, 1 equation, 4 figures, 5 tables.

Introduction
Related Work
Improving Code Generation Accuracy
Usability of AI Programming Assistants
Research Questions and Paper Organization
Proposed Approach: TiCoder
High-level Workflow
TiCoder Implementation
Generating Code and Tests
Ranking test suggestions
Pruning and ranking code suggestions
RQ1: User Study Methodology
Treatments
Control condition: AI Programming Assistant 1
Treatment condition: AI Programming Assistant 2
...and 24 more sections

Figures (4)

Figure 1: TiCoder workflow.
Figure 2: Example format, as well as code and test prompts for the running example.
Figure 3: Code and test suggestions for the running example in \ref{['fig:mbpp-sample-prompts']} generated from a LLM. Code suggestion $c_3$ and test suggestion $t_3$ are both correct, while code suggestions $c_1$, $c_2$ and test suggestions $t_1$, $t_2$ are incorrect (appear shaded), i.e. they don't satisfy the problem prompts in \ref{['fig:mbpp-sample-prompts']}.
Figure 4: From left to right: Examples of different interaction sequences invoked by Assistant 1 on task T1 (directly display all code suggestions), Assistant 2 on T2 (validate the test output on a given input), and Assistant 3 on T3 (specify the output for a given input).

LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation

TL;DR

Abstract

LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)