EvalCards: A Framework for Standardized Evaluation Reporting
Ruchira Dhar, Danae Sanchez Villegas, Antonia Karamolegkou, Alice Schiavone, Yifei Yuan, Xinyi Chen, Jiaang Li, Stella Frank, Laura De Grazia, Monorama Swain, Stephanie Brandl, Daniel Hershcovich, Anders Søgaard, Desmond Elliott
TL;DR
The paper identifies three crises in NLP evaluation reporting—reproducibility, accessibility, and governance—and argues that existing standards fail to center evaluation. It introduces Evaluation Disclosure Cards (EvalCards) as a lightweight, evaluation-focused reporting format with design principles (easy to write, easy to understand, hard to miss) and concrete content requirements (modalities, languages, capabilities, safety, developer notes). Through case studies, it demonstrates how EvalCards can improve reproducibility and visibility while supporting governance and regulatory needs, and discusses implementation details such as release pipelines and display locations. The authors discuss alternative views and outline future directions, including linking EvalCards to benchmarks and integrating them into regulatory frameworks to promote responsible AI deployment.
Abstract
Evaluation has long been a central concern in NLP, and transparent reporting practices are more critical than ever in today's landscape of rapidly released open-access models. Drawing on a survey of recent work on evaluation and documentation, we identify three persistent shortcomings in current reporting practices: reproducibility, accessibility, and governance. We argue that existing standardization efforts remain insufficient and introduce Evaluation Disclosure Cards (EvalCards) as a path forward. EvalCards are designed to enhance transparency for both researchers and practitioners while providing a practical foundation to meet emerging governance requirements.
