Table of Contents
Fetching ...

EvalCards: A Framework for Standardized Evaluation Reporting

Ruchira Dhar, Danae Sanchez Villegas, Antonia Karamolegkou, Alice Schiavone, Yifei Yuan, Xinyi Chen, Jiaang Li, Stella Frank, Laura De Grazia, Monorama Swain, Stephanie Brandl, Daniel Hershcovich, Anders Søgaard, Desmond Elliott

TL;DR

The paper identifies three crises in NLP evaluation reporting—reproducibility, accessibility, and governance—and argues that existing standards fail to center evaluation. It introduces Evaluation Disclosure Cards (EvalCards) as a lightweight, evaluation-focused reporting format with design principles (easy to write, easy to understand, hard to miss) and concrete content requirements (modalities, languages, capabilities, safety, developer notes). Through case studies, it demonstrates how EvalCards can improve reproducibility and visibility while supporting governance and regulatory needs, and discusses implementation details such as release pipelines and display locations. The authors discuss alternative views and outline future directions, including linking EvalCards to benchmarks and integrating them into regulatory frameworks to promote responsible AI deployment.

Abstract

Evaluation has long been a central concern in NLP, and transparent reporting practices are more critical than ever in today's landscape of rapidly released open-access models. Drawing on a survey of recent work on evaluation and documentation, we identify three persistent shortcomings in current reporting practices: reproducibility, accessibility, and governance. We argue that existing standardization efforts remain insufficient and introduce Evaluation Disclosure Cards (EvalCards) as a path forward. EvalCards are designed to enhance transparency for both researchers and practitioners while providing a practical foundation to meet emerging governance requirements.

EvalCards: A Framework for Standardized Evaluation Reporting

TL;DR

The paper identifies three crises in NLP evaluation reporting—reproducibility, accessibility, and governance—and argues that existing standards fail to center evaluation. It introduces Evaluation Disclosure Cards (EvalCards) as a lightweight, evaluation-focused reporting format with design principles (easy to write, easy to understand, hard to miss) and concrete content requirements (modalities, languages, capabilities, safety, developer notes). Through case studies, it demonstrates how EvalCards can improve reproducibility and visibility while supporting governance and regulatory needs, and discusses implementation details such as release pipelines and display locations. The authors discuss alternative views and outline future directions, including linking EvalCards to benchmarks and integrating them into regulatory frameworks to promote responsible AI deployment.

Abstract

Evaluation has long been a central concern in NLP, and transparent reporting practices are more critical than ever in today's landscape of rapidly released open-access models. Drawing on a survey of recent work on evaluation and documentation, we identify three persistent shortcomings in current reporting practices: reproducibility, accessibility, and governance. We argue that existing standardization efforts remain insufficient and introduce Evaluation Disclosure Cards (EvalCards) as a path forward. EvalCards are designed to enhance transparency for both researchers and practitioners while providing a practical foundation to meet emerging governance requirements.

Paper Structure

This paper contains 47 sections, 4 figures.

Figures (4)

  • Figure 1: Challenges in evaluation reporting (Section \ref{['sec:currentstate']}) and proposed solutions via EvalCards (Section \ref{['sec:evalcards']}). EvalCards provide capability reporting to improve reproducibility, a standardized format for accessibility, and safety and compliance documentation for governance.
  • Figure 2: EvalCard for OLMO-2-1124-7B-Instruct.
  • Figure 3: EvalCard for Qwen3-4B-Base.
  • Figure 4: EvalCard for Gemini Flash 2.0.