Table of Contents
Fetching ...

SPHERE: An Evaluation Card for Human-AI Systems

Qianou Ma, Dora Zhao, Xinran Zhao, Chenglei Si, Chenyang Yang, Ryan Louie, Ehud Reiter, Diyi Yang, Tongshuang Wu

TL;DR

This work introduces SPHERE, a five-dimension evaluation card for human-AI systems powered by large language models, designed to standardize how evaluations are planned and documented across NLP and HCI. By analyzing 39 papers, the authors demonstrate how SPHERE can guide designers to consider what is evaluated, how and who participates, when evaluations occur, and how they are validated, ultimately producing three recommendations to improve real-world relevance, validity, and meta-evaluation. The card functions as both a design tool and a documentation template, promoting transparency and reproducibility, as shown in two case studies that illustrate its utility for planning and reproducing evaluations. The study advocates broader adoption of SPHERE to enable more rigorous, actionable, and context-aware evaluations of human-AI systems in diverse domains.

Abstract

In the era of Large Language Models (LLMs), establishing effective evaluation methods and standards for diverse human-AI interaction systems is increasingly challenging. To encourage more transparent documentation and facilitate discussion on human-AI system evaluation design options, we present an evaluation card SPHERE, which encompasses five key dimensions: 1) What is being evaluated?; 2) How is the evaluation conducted?; 3) Who is participating in the evaluation?; 4) When is evaluation conducted?; 5) How is evaluation validated? We conduct a review of 39 human-AI systems using SPHERE, outlining current evaluation practices and areas for improvement. We provide three recommendations for improving the validity and rigor of evaluation practices.

SPHERE: An Evaluation Card for Human-AI Systems

TL;DR

This work introduces SPHERE, a five-dimension evaluation card for human-AI systems powered by large language models, designed to standardize how evaluations are planned and documented across NLP and HCI. By analyzing 39 papers, the authors demonstrate how SPHERE can guide designers to consider what is evaluated, how and who participates, when evaluations occur, and how they are validated, ultimately producing three recommendations to improve real-world relevance, validity, and meta-evaluation. The card functions as both a design tool and a documentation template, promoting transparency and reproducibility, as shown in two case studies that illustrate its utility for planning and reproducing evaluations. The study advocates broader adoption of SPHERE to enable more rigorous, actionable, and context-aware evaluations of human-AI systems in diverse domains.

Abstract

In the era of Large Language Models (LLMs), establishing effective evaluation methods and standards for diverse human-AI interaction systems is increasingly challenging. To encourage more transparent documentation and facilitate discussion on human-AI system evaluation design options, we present an evaluation card SPHERE, which encompasses five key dimensions: 1) What is being evaluated?; 2) How is the evaluation conducted?; 3) Who is participating in the evaluation?; 4) When is evaluation conducted?; 5) How is evaluation validated? We conduct a review of 39 human-AI systems using SPHERE, outlining current evaluation practices and areas for improvement. We provide three recommendations for improving the validity and rigor of evaluation practices.

Paper Structure

This paper contains 64 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Example Sphere card for the system AngleKindling petridis2023anglekindling.
  • Figure 2: PRISMA diagram depicting the search strategy used to identify human-LLM systems for inclusion in our literature review.
  • Figure 3: Distribution of evaluation annotations on the 39 papers by HCI or NLP venues using Sphere.
  • Figure 4: Example Sphere evaluation card for LearnLM-Tutor Jurenka2024learnLM. One can apply Sphere with one card per human-AI system as in \ref{['fig:angle-card']}, or one card per evaluation method for cleaner separation.
  • Figure 5: Example Sphere evaluation card for AngleKindling petridis2023anglekindling.