Table of Contents
Fetching ...

Branching Out: Broadening AI Measurement and Evaluation with Measurement Trees

Craig Greenberg, Patrick Hall, Theodore Jensen, Kristen Greene, Razvan Amironesei

TL;DR

This work introduces measurement trees as a principled, interpretable framework for evaluating complex AI systems by encoding a measurand as a hierarchical graph where leaves are data points and each node carries a summary function over its descendants. It formalizes the model with $H$ (hierarchical clustering) and $\mathfrak{F}$ (node-wise summaries), yielding a measurand $\mathcal{M} = (H,\mathfrak{F})$ that can integrate heterogeneous signals and support multi-level interpretation. The paper provides illustrative examples, including HELM-based benchmarking, and demonstrates a large-scale CoRIx use case that combines benchmarking, red teaming, and field testing to produce context-sensitive validity and reliability scores. It discusses strengths such as transparency and resistance to certain measurement biases, along with limitations like novelty, the need for domain expertise, and the current lack of uncertainty propagation in the baseline framework, while outlining future directions and providing open-source tooling. Collectively, measurement trees offer a scalable, transparent foundation for broader and more interpretable AI evaluation across sociotechnical contexts.

Abstract

This paper introduces \textit{measurement trees}, a novel class of metrics designed to combine various constructs into an interpretable multi-level representation of a measurand. Unlike conventional metrics that yield single values, vectors, surfaces, or categories, measurement trees produce a hierarchical directed graph in which each node summarizes its children through user-defined aggregation methods. In response to recent calls to expand the scope of AI system evaluation, measurement trees enhance metric transparency and facilitate the integration of heterogeneous evidence, including, e.g., agentic, business, energy-efficiency, sociotechnical, or security signals. We present definitions and examples, demonstrate practical utility through a large-scale measurement exercise, and provide accompanying open-source Python code. By operationalizing a transparent approach to measurement of complex constructs, this work offers a principled foundation for broader and more interpretable AI evaluation.

Branching Out: Broadening AI Measurement and Evaluation with Measurement Trees

TL;DR

This work introduces measurement trees as a principled, interpretable framework for evaluating complex AI systems by encoding a measurand as a hierarchical graph where leaves are data points and each node carries a summary function over its descendants. It formalizes the model with (hierarchical clustering) and (node-wise summaries), yielding a measurand that can integrate heterogeneous signals and support multi-level interpretation. The paper provides illustrative examples, including HELM-based benchmarking, and demonstrates a large-scale CoRIx use case that combines benchmarking, red teaming, and field testing to produce context-sensitive validity and reliability scores. It discusses strengths such as transparency and resistance to certain measurement biases, along with limitations like novelty, the need for domain expertise, and the current lack of uncertainty propagation in the baseline framework, while outlining future directions and providing open-source tooling. Collectively, measurement trees offer a scalable, transparent foundation for broader and more interpretable AI evaluation across sociotechnical contexts.

Abstract

This paper introduces \textit{measurement trees}, a novel class of metrics designed to combine various constructs into an interpretable multi-level representation of a measurand. Unlike conventional metrics that yield single values, vectors, surfaces, or categories, measurement trees produce a hierarchical directed graph in which each node summarizes its children through user-defined aggregation methods. In response to recent calls to expand the scope of AI system evaluation, measurement trees enhance metric transparency and facilitate the integration of heterogeneous evidence, including, e.g., agentic, business, energy-efficiency, sociotechnical, or security signals. We present definitions and examples, demonstrate practical utility through a large-scale measurement exercise, and provide accompanying open-source Python code. By operationalizing a transparent approach to measurement of complex constructs, this work offers a principled foundation for broader and more interpretable AI evaluation.

Paper Structure

This paper contains 25 sections, 4 theorems, 1 equation, 8 figures, 2 tables.

Key Result

Lemma 1

(Ordering of Composed Functions) Given a series of functions, $f_1,...,f_n$, s.t. the range of function $f_i$ is the domain the function $f_{i+1}$ and $\forall f_i$, $f_i$ induces an ordering, the composition of functions,$f_n(...(f_1(x)))$, induces an ordering.

Figures (8)

  • Figure 1: Illustration of example data points aggregated into various numbers of constructs and higher-level constructs.
  • Figure 2: Illustration of different ways of aggregating data points into constructs, which are represented with tree edges.
  • Figure 3: Illustration of aggregating example data points into higher level constructs using descriptive statistics as summary functions.
  • Figure 4: Llama 2 (70B) accuracy metric values from the HELM benchmark represented as a measurement tree with subcontructs aligned to HELM Core Scenarios. Mean win rate (MWR) is the summarization function for accuracy (the measurand) as well as the question answering, sentiment analysis, text classification, and toxicity classification subconstructs. Exact match (EM) and F1 metrics are used in lower-level nodes. For additional information see: https://crfm.stanford.edu/helm/classic/latest/.
  • Figure 5: An illustration of using a measurement tree to aggregate example data points into a common quality metric, accuracy.
  • ...and 3 more figures

Theorems & Definitions (10)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Lemma 1
  • Lemma 2
  • Theorem 1
  • Definition 6
  • Theorem 2