Table of Contents
Fetching ...

User Centric Evaluation of Code Generation Tools

Tanha Miah, Hong Zhu

TL;DR

This paper advances code-generation evaluation by shifting from purely performance-based benchmarks to user-centric usability assessment. It introduces a metadata-rich, scenario-based benchmark and a multi-attempt testing process that mirrors real user interactions, along with a suite of usability quality attributes and user-experience metrics. The authors validate the approach through a case study evaluating ChatGPT for R programming, revealing strong overall usability but weaknesses in conciseness and visualization tasks, and they discuss limitations and threats to validity. The framework enables more realistic, scenario-driven evaluations that can inform tool improvement and adoption decisions in software development contexts.

Abstract

With the rapid advance of machine learning (ML) technology, large language models (LLMs) are increasingly explored as an intelligent tool to generate program code from natural language specifications. However, existing evaluations of LLMs have focused on their capabilities in comparison with humans. It is desirable to evaluate their usability when deciding on whether to use a LLM in software production. This paper proposes a user centric method for this purpose. It includes metadata in the test cases of a benchmark to describe their usages, conducts testing in a multi-attempt process that mimics the uses of LLMs, measures LLM generated solutions on a set of quality attributes that reflect usability, and evaluates the performance based on user experiences in the uses of LLMs as a tool. The paper also reports a case study with the method in the evaluation of ChatGPT's usability as a code generation tool for the R programming language. Our experiments demonstrated that ChatGPT is highly useful for generating R program code although it may fail on hard programming tasks. The user experiences are good with overall average number of attempts being 1.61 and the average time of completion being 47.02 seconds. Our experiments also found that the weakest aspect of usability is conciseness, which has a score of 3.80 out of 5.

User Centric Evaluation of Code Generation Tools

TL;DR

This paper advances code-generation evaluation by shifting from purely performance-based benchmarks to user-centric usability assessment. It introduces a metadata-rich, scenario-based benchmark and a multi-attempt testing process that mirrors real user interactions, along with a suite of usability quality attributes and user-experience metrics. The authors validate the approach through a case study evaluating ChatGPT for R programming, revealing strong overall usability but weaknesses in conciseness and visualization tasks, and they discuss limitations and threats to validity. The framework enables more realistic, scenario-driven evaluations that can inform tool improvement and adoption decisions in software development contexts.

Abstract

With the rapid advance of machine learning (ML) technology, large language models (LLMs) are increasingly explored as an intelligent tool to generate program code from natural language specifications. However, existing evaluations of LLMs have focused on their capabilities in comparison with humans. It is desirable to evaluate their usability when deciding on whether to use a LLM in software production. This paper proposes a user centric method for this purpose. It includes metadata in the test cases of a benchmark to describe their usages, conducts testing in a multi-attempt process that mimics the uses of LLMs, measures LLM generated solutions on a set of quality attributes that reflect usability, and evaluates the performance based on user experiences in the uses of LLMs as a tool. The paper also reports a case study with the method in the evaluation of ChatGPT's usability as a code generation tool for the R programming language. Our experiments demonstrated that ChatGPT is highly useful for generating R program code although it may fail on hard programming tasks. The user experiences are good with overall average number of attempts being 1.61 and the average time of completion being 47.02 seconds. Our experiments also found that the weakest aspect of usability is conciseness, which has a score of 3.80 out of 5.
Paper Structure (22 sections, 13 figures, 3 tables)

This paper contains 22 sections, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Structure of JSON Representation of Test Cases.
  • Figure 2: Example of test case represented in the JSON format.
  • Figure 3: Average Performance Scores on Quality Criteria.
  • Figure 4: Distribution of the Numbers of Attempts.
  • Figure 5: Distribution of Completion Times (Seconds).
  • ...and 8 more figures