Table of Contents
Fetching ...

Towards Ecologically Valid LLM Benchmarks: Understanding and Designing Domain-Centered Evaluations for Journalism Practitioners

Charlotte Li, Nick Hagar, Sachita Nishal, Jeremy Gilbert, Nick Diakopoulos

TL;DR

This work addresses the challenge of ecological and construct validity in large language model benchmarks by adopting a human-centered, domain-specific approach in journalism. Through a practitioner-led workshop with 23 journalists and a case study of an information-extraction task, the authors derive design guidelines and instantiate a modular benchmark cookbook that supports context-aware, value-driven evaluation while preserving professional judgment. Key contributions include (i) a set of design implications for values-driven metrics, task-context mapping, and modular benchmark design, and (ii) a practical notebook-based cookbook that can be adapted across newsrooms to foster ecologically valid evaluation. The work demonstrates how domain practitioners can shape benchmarking to reflect real-world use, potentially generalizing to other professional domains amid ongoing debates about benchmark utility and validity.

Abstract

Benchmarks play a significant role in how researchers and the public understand generative AI systems. However, the widespread use of benchmark scores to communicate about model capabilities has led to criticisms of validity, especially whether benchmarks test what they claim to test (i.e. construct validity) and whether benchmark evaluations are representative of how models are used in the wild (i.e. ecological validity). In this work we explore how to create an LLM benchmark that addresses these issues by taking a human-centered approach. We focus on designing a domain-oriented benchmark for journalism practitioners, drawing on insights from a workshop of 23 journalism professionals. Our workshop findings surface specific challenges that inform benchmark design opportunities, which we instantiate in a case study that addresses underlying criticisms and specific domain concerns. Through our findings and design case study, this work provides design guidance for developing benchmarks that are better tuned to specific domains.

Towards Ecologically Valid LLM Benchmarks: Understanding and Designing Domain-Centered Evaluations for Journalism Practitioners

TL;DR

This work addresses the challenge of ecological and construct validity in large language model benchmarks by adopting a human-centered, domain-specific approach in journalism. Through a practitioner-led workshop with 23 journalists and a case study of an information-extraction task, the authors derive design guidelines and instantiate a modular benchmark cookbook that supports context-aware, value-driven evaluation while preserving professional judgment. Key contributions include (i) a set of design implications for values-driven metrics, task-context mapping, and modular benchmark design, and (ii) a practical notebook-based cookbook that can be adapted across newsrooms to foster ecologically valid evaluation. The work demonstrates how domain practitioners can shape benchmarking to reflect real-world use, potentially generalizing to other professional domains amid ongoing debates about benchmark utility and validity.

Abstract

Benchmarks play a significant role in how researchers and the public understand generative AI systems. However, the widespread use of benchmark scores to communicate about model capabilities has led to criticisms of validity, especially whether benchmarks test what they claim to test (i.e. construct validity) and whether benchmark evaluations are representative of how models are used in the wild (i.e. ecological validity). In this work we explore how to create an LLM benchmark that addresses these issues by taking a human-centered approach. We focus on designing a domain-oriented benchmark for journalism practitioners, drawing on insights from a workshop of 23 journalism professionals. Our workshop findings surface specific challenges that inform benchmark design opportunities, which we instantiate in a case study that addresses underlying criticisms and specific domain concerns. Through our findings and design case study, this work provides design guidance for developing benchmarks that are better tuned to specific domains.

Paper Structure

This paper contains 28 sections, 1 figure, 2 tables.

Table of Contents

  1. Introduction
  2. Related Work
  3. The Challenges of Benchmarking
  4. Domain-Centered Benchmarking
  5. Workshopping a Benchmark for Journalism with Practitioners
  6. Workshop Organization
  7. Participants
  8. Workshop Structure
  9. Data Collection and Analysis
  10. Workshop Findings
  11. Values as Evaluation Metrics
  12. Design Implication: Values-Driven Metrics. Our findings here suggest that when designing a benchmark for using AI in newsrooms, journalistic values reflect tendencies that can meaningfully guide evaluation metrics that could cut across different use cases. At the same time values need to be operationalized in use case specific ways in order to be valid constructs for measurement.
  13. Generalizability and Specific Context
  14. Design Implication: Mapping Context. Given the importance of specific task contexts in successful performance of journalistic tasks, benchmarks need to engage in systematically mapping variations in tasks and task contexts based on domain practitioner knowledge. For instance, based on our findings we could characterize possible dimensions of variability in the context of journalism tasks, such as audience orientation or input document type. From there, any evaluation of generative AI systems could be intentional and elaborative about the specific context it is testing for, by mapping out its position in different dimensions of variability for the context. Moreover, identifying specific task contexts and their correspondence to practice would support ecological validity.
  15. Dataset Construction
  16. ...and 13 more sections

Figures (1)

  • Figure 1: Google Colaboratory Notebooks are modular and allow us to interweave markdown text and code explaining rationales behind decisions in the evaluation process.