SPHERE: An Evaluation Card for Human-AI Systems
Qianou Ma, Dora Zhao, Xinran Zhao, Chenglei Si, Chenyang Yang, Ryan Louie, Ehud Reiter, Diyi Yang, Tongshuang Wu
TL;DR
This work introduces SPHERE, a five-dimension evaluation card for human-AI systems powered by large language models, designed to standardize how evaluations are planned and documented across NLP and HCI. By analyzing 39 papers, the authors demonstrate how SPHERE can guide designers to consider what is evaluated, how and who participates, when evaluations occur, and how they are validated, ultimately producing three recommendations to improve real-world relevance, validity, and meta-evaluation. The card functions as both a design tool and a documentation template, promoting transparency and reproducibility, as shown in two case studies that illustrate its utility for planning and reproducing evaluations. The study advocates broader adoption of SPHERE to enable more rigorous, actionable, and context-aware evaluations of human-AI systems in diverse domains.
Abstract
In the era of Large Language Models (LLMs), establishing effective evaluation methods and standards for diverse human-AI interaction systems is increasingly challenging. To encourage more transparent documentation and facilitate discussion on human-AI system evaluation design options, we present an evaluation card SPHERE, which encompasses five key dimensions: 1) What is being evaluated?; 2) How is the evaluation conducted?; 3) Who is participating in the evaluation?; 4) When is evaluation conducted?; 5) How is evaluation validated? We conduct a review of 39 human-AI systems using SPHERE, outlining current evaluation practices and areas for improvement. We provide three recommendations for improving the validity and rigor of evaluation practices.
