Benchmarks as Microscopes: A Call for Model Metrology
Michael Saxon, Ari Holtzman, Peter West, William Yang Wang, Naomi Saphra
TL;DR
The paper argues that static LM benchmarks saturate and fail to predict real-world deployment, motivating a new discipline called model metrology that emphasizes constrained, dynamic, and plug-and-play evaluations. It advocates forming a dedicated community to bridge researchers, developers, and users, develop targeted benchmarks through constraint-based adversarial testing, and share theory-driven frameworks and tooling. Key contributions include a manifesto for metrology culture, concrete desiderata for benchmarks, and proposed pathways to unite proto-metrology communities, solicit domain constraints, and collaborate with related fields. The practical impact is to enable more reliable, domain-relevant evaluation of LM capabilities, support better deployment decisions, and foster a mature engineering discipline around measurement and auditing of AI systems.
Abstract
Modern language models (LMs) pose a new challenge in capability assessment. Static benchmarks inevitably saturate without providing confidence in the deployment tolerances of LM-based systems, but developers nonetheless claim that their models have generalized traits such as reasoning or open-domain language understanding based on these flawed metrics. The science and practice of LMs requires a new approach to benchmarking which measures specific capabilities with dynamic assessments. To be confident in our metrics, we need a new discipline of model metrology -- one which focuses on how to generate benchmarks that predict performance under deployment. Motivated by our evaluation criteria, we outline how building a community of model metrology practitioners -- one focused on building tools and studying how to measure system capabilities -- is the best way to meet these needs to and add clarity to the AI discussion.
