Bridging HCI and AI Research for the Evaluation of Conversational SE Assistants
Jonan Richards, Mairieli Wessel
TL;DR
This paper tackles the challenge of evaluating LLM-based conversational SE assistants at scale while ensuring alignment with developers' needs. It argues for a combined approach that fuses simulated users (for realistic, qualitative data) with LLM-as-a-Judge (for scalable, quantitative assessment) to achieve automatic, human-centered evaluation. The authors outline four evaluation requirements—realistic multi-turn conversations, diversity, quantitative metrics, and qualitative insights—and propose a workflow that iterates between simulated interactions and judge-based scoring. They discuss challenges in persona realism, bias mitigation, and contextual grounding, and position the method as a complement to manual user studies rather than a replacement.
Abstract
As Large Language Models (LLMs) are increasingly adopted in software engineering, recently in the form of conversational assistants, ensuring these technologies align with developers' needs is essential. The limitations of traditional human-centered methods for evaluating LLM-based tools at scale raise the need for automatic evaluation. In this paper, we advocate combining insights from human-computer interaction (HCI) and artificial intelligence (AI) research to enable human-centered automatic evaluation of LLM-based conversational SE assistants. We identify requirements for such evaluation and challenges down the road, working towards a framework that ensures these assistants are designed and deployed in line with user needs.
