Towards Ecologically Valid LLM Benchmarks: Understanding and Designing Domain-Centered Evaluations for Journalism Practitioners
Charlotte Li, Nick Hagar, Sachita Nishal, Jeremy Gilbert, Nick Diakopoulos
TL;DR
This work addresses the challenge of ecological and construct validity in large language model benchmarks by adopting a human-centered, domain-specific approach in journalism. Through a practitioner-led workshop with 23 journalists and a case study of an information-extraction task, the authors derive design guidelines and instantiate a modular benchmark cookbook that supports context-aware, value-driven evaluation while preserving professional judgment. Key contributions include (i) a set of design implications for values-driven metrics, task-context mapping, and modular benchmark design, and (ii) a practical notebook-based cookbook that can be adapted across newsrooms to foster ecologically valid evaluation. The work demonstrates how domain practitioners can shape benchmarking to reflect real-world use, potentially generalizing to other professional domains amid ongoing debates about benchmark utility and validity.
Abstract
Benchmarks play a significant role in how researchers and the public understand generative AI systems. However, the widespread use of benchmark scores to communicate about model capabilities has led to criticisms of validity, especially whether benchmarks test what they claim to test (i.e. construct validity) and whether benchmark evaluations are representative of how models are used in the wild (i.e. ecological validity). In this work we explore how to create an LLM benchmark that addresses these issues by taking a human-centered approach. We focus on designing a domain-oriented benchmark for journalism practitioners, drawing on insights from a workshop of 23 journalism professionals. Our workshop findings surface specific challenges that inform benchmark design opportunities, which we instantiate in a case study that addresses underlying criticisms and specific domain concerns. Through our findings and design case study, this work provides design guidance for developing benchmarks that are better tuned to specific domains.
