Table of Contents
Fetching ...

Human-Centered Evaluation of RAG outputs: a framework and questionnaire for human-AI collaboration

Aline Mangold, Kiran Hoffmann

TL;DR

This work addresses the lack of human-centered evaluation for retrieval-augmented generation by proposing a questionnaire grounded in Gienapp's utility taxonomy and enabling human–AI collaboration. Through directed content analysis, iterative refinement, and an evaluation with two raters plus a GPT-4o judge, the authors distill a final 12-metric instrument rated on a 5-point scale. Key contributions include the development of a human-centered framework that separates human-only from human–LLM rated items, insights into alignment and formatting issues between humans and LLMs, and practical guidance for deploying such evaluations in real-world RAG systems. The framework offers a pragmatic pathway to improving user satisfaction, information verifiability, and output structuring in knowledge-grounded AI systems while highlighting ethical considerations and avenues for future validation and extension.

Abstract

Retrieval-augmented generation (RAG) systems are increasingly deployed in user-facing applications, yet systematic, human-centered evaluation of their outputs remains underexplored. Building on Gienapp's utility-dimension framework, we designed a human-centred questionnaire that assesses RAG outputs across 12 dimensions. We iteratively refined the questionnaire through several rounds of ratings on a set of query-output pairs and semantic discussions. Ultimately, we incorporated feedback from both a human rater and a human-LLM pair. Results indicate that while large language models (LLMs) reliably focus on metric descriptions and scale labels, they exhibit weaknesses in detecting textual format variations. Humans struggled to focus strictly on metric descriptions and labels. LLM ratings and explanations were viewed as a helpful support, but numeric LLM and human ratings lacked agreement. The final questionnaire extends the initial framework by focusing on user intent, text structuring, and information verifiability.

Human-Centered Evaluation of RAG outputs: a framework and questionnaire for human-AI collaboration

TL;DR

This work addresses the lack of human-centered evaluation for retrieval-augmented generation by proposing a questionnaire grounded in Gienapp's utility taxonomy and enabling human–AI collaboration. Through directed content analysis, iterative refinement, and an evaluation with two raters plus a GPT-4o judge, the authors distill a final 12-metric instrument rated on a 5-point scale. Key contributions include the development of a human-centered framework that separates human-only from human–LLM rated items, insights into alignment and formatting issues between humans and LLMs, and practical guidance for deploying such evaluations in real-world RAG systems. The framework offers a pragmatic pathway to improving user satisfaction, information verifiability, and output structuring in knowledge-grounded AI systems while highlighting ethical considerations and avenues for future validation and extension.

Abstract

Retrieval-augmented generation (RAG) systems are increasingly deployed in user-facing applications, yet systematic, human-centered evaluation of their outputs remains underexplored. Building on Gienapp's utility-dimension framework, we designed a human-centred questionnaire that assesses RAG outputs across 12 dimensions. We iteratively refined the questionnaire through several rounds of ratings on a set of query-output pairs and semantic discussions. Ultimately, we incorporated feedback from both a human rater and a human-LLM pair. Results indicate that while large language models (LLMs) reliably focus on metric descriptions and scale labels, they exhibit weaknesses in detecting textual format variations. Humans struggled to focus strictly on metric descriptions and labels. LLM ratings and explanations were viewed as a helpful support, but numeric LLM and human ratings lacked agreement. The final questionnaire extends the initial framework by focusing on user intent, text structuring, and information verifiability.

Paper Structure

This paper contains 29 sections, 1 figure.

Figures (1)

  • Figure 1: Taxnonomy of utility dimensions gienappEvaluatingGenerativeAd2024. Licensed under CCAI 4.0 License.