Table of Contents
Fetching ...

Recommendations and Reporting Checklist for Rigorous & Transparent Human Baselines in Model Evaluations

Kevin L. Wei, Patricia Paskov, Sunishchal Dev, Michael J. Byun, Anka Reuel, Xavier Roberts-Gaal, Rachel Calcott, Evie Coxon, Chinmay Deshpande

TL;DR

This work tackles the problem that human baselines in foundation-model evaluations are frequently underpowered, underreported, and insufficiently documented. It introduces a measurement-theory–driven framework and a comprehensive reporting checklist, validated through a two-stage process and the systematic review of $n=115$ baselines ($n=7$ model-card baselines) with materials available publicly. Key contributions include methodological recommendations, a practical checklist for transparency, and an empirical catalog of common gaps in current baselines, all aimed at improving interpretability and usefulness for researchers, practitioners, and policymakers. By promoting rigorous baselines and open reporting, the paper seeks to enhance the credibility and usability of human-vs-AI evaluation results in real-world AI deployments.

Abstract

In this position paper, we argue that human baselines in foundation model evaluations must be more rigorous and more transparent to enable meaningful comparisons of human vs. AI performance, and we provide recommendations and a reporting checklist towards this end. Human performance baselines are vital for the machine learning community, downstream users, and policymakers to interpret AI evaluations. Models are often claimed to achieve "super-human" performance, but existing baselining methods are neither sufficiently rigorous nor sufficiently well-documented to robustly measure and assess performance differences. Based on a meta-review of the measurement theory and AI evaluation literatures, we derive a framework with recommendations for designing, executing, and reporting human baselines. We synthesize our recommendations into a checklist that we use to systematically review 115 human baselines (studies) in foundation model evaluations and thus identify shortcomings in existing baselining methods; our checklist can also assist researchers in conducting human baselines and reporting results. We hope our work can advance more rigorous AI evaluation practices that can better serve both the research community and policymakers. Data is available at: https://github.com/kevinlwei/human-baselines

Recommendations and Reporting Checklist for Rigorous & Transparent Human Baselines in Model Evaluations

TL;DR

This work tackles the problem that human baselines in foundation-model evaluations are frequently underpowered, underreported, and insufficiently documented. It introduces a measurement-theory–driven framework and a comprehensive reporting checklist, validated through a two-stage process and the systematic review of baselines ( model-card baselines) with materials available publicly. Key contributions include methodological recommendations, a practical checklist for transparency, and an empirical catalog of common gaps in current baselines, all aimed at improving interpretability and usefulness for researchers, practitioners, and policymakers. By promoting rigorous baselines and open reporting, the paper seeks to enhance the credibility and usability of human-vs-AI evaluation results in real-world AI deployments.

Abstract

In this position paper, we argue that human baselines in foundation model evaluations must be more rigorous and more transparent to enable meaningful comparisons of human vs. AI performance, and we provide recommendations and a reporting checklist towards this end. Human performance baselines are vital for the machine learning community, downstream users, and policymakers to interpret AI evaluations. Models are often claimed to achieve "super-human" performance, but existing baselining methods are neither sufficiently rigorous nor sufficiently well-documented to robustly measure and assess performance differences. Based on a meta-review of the measurement theory and AI evaluation literatures, we derive a framework with recommendations for designing, executing, and reporting human baselines. We synthesize our recommendations into a checklist that we use to systematically review 115 human baselines (studies) in foundation model evaluations and thus identify shortcomings in existing baselining methods; our checklist can also assist researchers in conducting human baselines and reporting results. We hope our work can advance more rigorous AI evaluation practices that can better serve both the research community and policymakers. Data is available at: https://github.com/kevinlwei/human-baselines

Paper Structure

This paper contains 38 sections, 5 figures, 20 tables.

Figures (5)

  • Figure 1: A summary of our recommendations for robust and transparent human baselines. Full recommendations in Section \ref{['sec:Framework']} and full checklist in Appendix \ref{['sec:Appendix_Checklist']}.
  • Figure 2: Frequency of years in which reviewed evaluations were published.
  • Figure 3: Frequency of publication venues of reviewed evaluations, in descending order. "Top ML/AI conferences & journals" are: ICML, NeurIPS, ICLR, UAI, AISTATS, COLT, ALT, JMLR, TMLR, CVPR, ICCV, ACL, NAACL, EMNLP, and SIMODS.
  • Figure 4: Frequency of languages in which reviewed evaluations' items were written, in descending order. Note that individual items may contain items in multiple languages.
  • Figure I: A summary of our recommendations for robust and transparent human baselines. Definitions of each stage of the baseline lifecycle are provided in Table \ref{['tab:Exec_Summary_Stages']}, and more details about our recommendations are provided in Table \ref{['tab:Exec_Summary_Recs']}. Full recommendations are in Section \ref{['sec:Framework']} and full checklist is in Appendix \ref{['sec:Appendix_Checklist']}.