ECBD: Evidence-Centered Benchmark Design for NLP

Yu Lu Liu; Su Lin Blodgett; Jackie Chi Kit Cheung; Q. Vera Liao; Alexandra Olteanu; Ziang Xiao

ECBD: Evidence-Centered Benchmark Design for NLP

Yu Lu Liu, Su Lin Blodgett, Jackie Chi Kit Cheung, Q. Vera Liao, Alexandra Olteanu, Ziang Xiao

TL;DR

The paper tackles the problem of evaluating NLP benchmarks with insufficient principled validity analyses. It introduces Evidence-Centered Benchmark Design (ECBD), a five-module framework that structures benchmark design around capabilities, content, adaptation, assembly, and evidence, plus a worksheet to document and justify design decisions. Through case studies on BoolQ, SuperGLUE, and HELM, the authors reveal pervasive gaps in intended-use specification, capability conceptualization, data justification, adaptation prescriptions, assembly transparency, and validity evidence. The framework offers a practical path to increase transparency, interpretability, and validity of NLP benchmarks and suggests directions for broader, more robust benchmarking across modalities and use contexts.

Abstract

Benchmarking is seen as critical to assessing progress in NLP. However, creating a benchmark involves many design decisions (e.g., which datasets to include, which metrics to use) that often rely on tacit, untested assumptions about what the benchmark is intended to measure or is actually measuring. There is currently no principled way of analyzing these decisions and how they impact the validity of the benchmark's measurements. To address this gap, we draw on evidence-centered design in educational assessments and propose Evidence-Centered Benchmark Design (ECBD), a framework which formalizes the benchmark design process into five modules. ECBD specifies the role each module plays in helping practitioners collect evidence about capabilities of interest. Specifically, each module requires benchmark designers to describe, justify, and support benchmark design choices -- e.g., clearly specifying the capabilities the benchmark aims to measure or how evidence about those capabilities is collected from model responses. To demonstrate the use of ECBD, we conduct case studies with three benchmarks: BoolQ, SuperGLUE, and HELM. Our analysis reveals common trends in benchmark design and documentation that could threaten the validity of benchmarks' measurements.

ECBD: Evidence-Centered Benchmark Design for NLP

TL;DR

Abstract

Paper Structure (44 sections, 3 figures, 1 table)

This paper contains 44 sections, 3 figures, 1 table.

Introduction
Background & Related Work
Benchmarking in NLP
Critiques and Meta-Analyses
NLP and ML Documentation
Measurement Theory
Evidence-Centered Design (ECD) in Education
Evidence-Centered Benchmark Design
Capability Module
Content Module
Adaptation Module
Assembly Module
Evidence Module
Evidence Extraction
Evidence Accumulation
...and 29 more sections

Figures (3)

Figure 1: Simplified schema of the Evidence-Centered Benchmark Design ($\texttt{ECBD}$) framework. Solid line arrows indicate the process of designing a benchmark (e.g., designers should decide on the intended uses of the benchmark before deciding what capabilities are of interest). The dotted line arrows indicate the process wherein the benchmark gathers necessary evidence.
Figure 2: The Evidence-Centered Benchmark Design framework. Solid line arrows indicate the process of designing a benchmark (e.g., designers decide on the intended uses of the benchmark before deciding what capabilities are of interest). The dotted line arrows indicate the process of the benchmark gathering necessary capability evidence.
Figure 3: Different levels of capabilities and their connection, in HELM and SuperGLUE. In SuperGLUE, the connection between sub-capabilities (e.g., "causal reasoning") and "general-purpose language understanding" is not explained. It is thus denoted by the dotted lines and the question mark.

ECBD: Evidence-Centered Benchmark Design for NLP

TL;DR

Abstract

ECBD: Evidence-Centered Benchmark Design for NLP

Authors

TL;DR

Abstract

Table of Contents

Figures (3)