Branching Out: Broadening AI Measurement and Evaluation with Measurement Trees
Craig Greenberg, Patrick Hall, Theodore Jensen, Kristen Greene, Razvan Amironesei
TL;DR
This work introduces measurement trees as a principled, interpretable framework for evaluating complex AI systems by encoding a measurand as a hierarchical graph where leaves are data points and each node carries a summary function over its descendants. It formalizes the model with $H$ (hierarchical clustering) and $\mathfrak{F}$ (node-wise summaries), yielding a measurand $\mathcal{M} = (H,\mathfrak{F})$ that can integrate heterogeneous signals and support multi-level interpretation. The paper provides illustrative examples, including HELM-based benchmarking, and demonstrates a large-scale CoRIx use case that combines benchmarking, red teaming, and field testing to produce context-sensitive validity and reliability scores. It discusses strengths such as transparency and resistance to certain measurement biases, along with limitations like novelty, the need for domain expertise, and the current lack of uncertainty propagation in the baseline framework, while outlining future directions and providing open-source tooling. Collectively, measurement trees offer a scalable, transparent foundation for broader and more interpretable AI evaluation across sociotechnical contexts.
Abstract
This paper introduces \textit{measurement trees}, a novel class of metrics designed to combine various constructs into an interpretable multi-level representation of a measurand. Unlike conventional metrics that yield single values, vectors, surfaces, or categories, measurement trees produce a hierarchical directed graph in which each node summarizes its children through user-defined aggregation methods. In response to recent calls to expand the scope of AI system evaluation, measurement trees enhance metric transparency and facilitate the integration of heterogeneous evidence, including, e.g., agentic, business, energy-efficiency, sociotechnical, or security signals. We present definitions and examples, demonstrate practical utility through a large-scale measurement exercise, and provide accompanying open-source Python code. By operationalizing a transparent approach to measurement of complex constructs, this work offers a principled foundation for broader and more interpretable AI evaluation.
