Table of Contents
Fetching ...

Bridging HCI and AI Research for the Evaluation of Conversational SE Assistants

Jonan Richards, Mairieli Wessel

TL;DR

This paper tackles the challenge of evaluating LLM-based conversational SE assistants at scale while ensuring alignment with developers' needs. It argues for a combined approach that fuses simulated users (for realistic, qualitative data) with LLM-as-a-Judge (for scalable, quantitative assessment) to achieve automatic, human-centered evaluation. The authors outline four evaluation requirements—realistic multi-turn conversations, diversity, quantitative metrics, and qualitative insights—and propose a workflow that iterates between simulated interactions and judge-based scoring. They discuss challenges in persona realism, bias mitigation, and contextual grounding, and position the method as a complement to manual user studies rather than a replacement.

Abstract

As Large Language Models (LLMs) are increasingly adopted in software engineering, recently in the form of conversational assistants, ensuring these technologies align with developers' needs is essential. The limitations of traditional human-centered methods for evaluating LLM-based tools at scale raise the need for automatic evaluation. In this paper, we advocate combining insights from human-computer interaction (HCI) and artificial intelligence (AI) research to enable human-centered automatic evaluation of LLM-based conversational SE assistants. We identify requirements for such evaluation and challenges down the road, working towards a framework that ensures these assistants are designed and deployed in line with user needs.

Bridging HCI and AI Research for the Evaluation of Conversational SE Assistants

TL;DR

This paper tackles the challenge of evaluating LLM-based conversational SE assistants at scale while ensuring alignment with developers' needs. It argues for a combined approach that fuses simulated users (for realistic, qualitative data) with LLM-as-a-Judge (for scalable, quantitative assessment) to achieve automatic, human-centered evaluation. The authors outline four evaluation requirements—realistic multi-turn conversations, diversity, quantitative metrics, and qualitative insights—and propose a workflow that iterates between simulated interactions and judge-based scoring. They discuss challenges in persona realism, bias mitigation, and contextual grounding, and position the method as a complement to manual user studies rather than a replacement.

Abstract

As Large Language Models (LLMs) are increasingly adopted in software engineering, recently in the form of conversational assistants, ensuring these technologies align with developers' needs is essential. The limitations of traditional human-centered methods for evaluating LLM-based tools at scale raise the need for automatic evaluation. In this paper, we advocate combining insights from human-computer interaction (HCI) and artificial intelligence (AI) research to enable human-centered automatic evaluation of LLM-based conversational SE assistants. We identify requirements for such evaluation and challenges down the road, working towards a framework that ensures these assistants are designed and deployed in line with user needs.

Paper Structure

This paper contains 16 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Flaws in reference-based evaluation: incorrect response A scores high on BLEU-4, while correct response B scores low due to phrasing.
  • Figure 2: Reference-free datasets miss conversations where responses refer to earlier messages.