Benchmarks as Microscopes: A Call for Model Metrology

Michael Saxon; Ari Holtzman; Peter West; William Yang Wang; Naomi Saphra

Benchmarks as Microscopes: A Call for Model Metrology

Michael Saxon, Ari Holtzman, Peter West, William Yang Wang, Naomi Saphra

TL;DR

The paper argues that static LM benchmarks saturate and fail to predict real-world deployment, motivating a new discipline called model metrology that emphasizes constrained, dynamic, and plug-and-play evaluations. It advocates forming a dedicated community to bridge researchers, developers, and users, develop targeted benchmarks through constraint-based adversarial testing, and share theory-driven frameworks and tooling. Key contributions include a manifesto for metrology culture, concrete desiderata for benchmarks, and proposed pathways to unite proto-metrology communities, solicit domain constraints, and collaborate with related fields. The practical impact is to enable more reliable, domain-relevant evaluation of LM capabilities, support better deployment decisions, and foster a mature engineering discipline around measurement and auditing of AI systems.

Abstract

Modern language models (LMs) pose a new challenge in capability assessment. Static benchmarks inevitably saturate without providing confidence in the deployment tolerances of LM-based systems, but developers nonetheless claim that their models have generalized traits such as reasoning or open-domain language understanding based on these flawed metrics. The science and practice of LMs requires a new approach to benchmarking which measures specific capabilities with dynamic assessments. To be confident in our metrics, we need a new discipline of model metrology -- one which focuses on how to generate benchmarks that predict performance under deployment. Motivated by our evaluation criteria, we outline how building a community of model metrology practitioners -- one focused on building tools and studying how to measure system capabilities -- is the best way to meet these needs to and add clarity to the AI discussion.

Benchmarks as Microscopes: A Call for Model Metrology

TL;DR

Abstract

Paper Structure (24 sections)

This paper contains 24 sections.

Introduction
Problems with current benchmarks for LMs
Generalized capabilities are hard to define and contentious.
Benchmarks can aim for generality---or they can be valid and useful.
We know existing benchmarks are flawed. Why do we keep using them?
Qualities of useful, concrete benchmarks
The promise of a model metrology community
A dedicated community can better connect researchers, developers, and users.
Metrologists will produce targeted dynamic benchmarks for complex problems.
Model metrologists will establish shared knowledge & techniques.
Shared framings of abstract capabilities across concrete settings.
A shift from observations to theories and science.
Quality benchmark-building tools.
Metrology culture prioritizes data work, methodological rigor, and proactive criticism.
How do we build the model metrology discipline?
...and 9 more sections

Benchmarks as Microscopes: A Call for Model Metrology

TL;DR

Abstract

Benchmarks as Microscopes: A Call for Model Metrology

Authors

TL;DR

Abstract

Table of Contents