Table of Contents
Fetching ...

Quantifying construct validity in large language model evaluations

Ryan Othniel Kearns

TL;DR

This thesis presents the structured capabilities model, the first model to extract interpretable and generalisable capabilities from a large collection of LLM benchmark results, and demonstrates better explanatory and predictive power for quantifying construct validity in LLM evaluations.

Abstract

The LLM community often reports benchmark results as if they are synonymous with general model capabilities. However, benchmarks can have problems that distort performance, like test set contamination and annotator error. How can we know that a benchmark is a reliable indicator of some capability that we want to measure? This question concerns the construct validity of LLM benchmarks, and it requires separating benchmark results from capabilities when we model and predict LLM performance. Both social scientists and computer scientists propose formal models - latent factor models and scaling laws - for identifying the capabilities underlying benchmark scores. However, neither technique is satisfactory for construct validity. Latent factor models ignore scaling laws, and as a result, the capabilities they extract often proxy model size. Scaling laws ignore measurement error, and as a result, the capabilities they extract are both uninterpretable and overfit to the observed benchmarks. This thesis presents the structured capabilities model, the first model to extract interpretable and generalisable capabilities from a large collection of LLM benchmark results. I fit this model and its two alternatives on a large sample of results from the OpenLLM Leaderboard. Structured capabilities outperform latent factor models on parsimonious fit indices, and exhibit better out-of-distribution benchmark prediction than scaling laws. These improvements are possible because neither existing approach separates model scale from capabilities in the appropriate way. Model scale should inform capabilities, as in scaling laws, and these capabilities should inform observed results up to measurement error, as in latent factor models. In combining these two insights, structured capabilities demonstrate better explanatory and predictive power for quantifying construct validity in LLM evaluations.

Quantifying construct validity in large language model evaluations

TL;DR

This thesis presents the structured capabilities model, the first model to extract interpretable and generalisable capabilities from a large collection of LLM benchmark results, and demonstrates better explanatory and predictive power for quantifying construct validity in LLM evaluations.

Abstract

The LLM community often reports benchmark results as if they are synonymous with general model capabilities. However, benchmarks can have problems that distort performance, like test set contamination and annotator error. How can we know that a benchmark is a reliable indicator of some capability that we want to measure? This question concerns the construct validity of LLM benchmarks, and it requires separating benchmark results from capabilities when we model and predict LLM performance. Both social scientists and computer scientists propose formal models - latent factor models and scaling laws - for identifying the capabilities underlying benchmark scores. However, neither technique is satisfactory for construct validity. Latent factor models ignore scaling laws, and as a result, the capabilities they extract often proxy model size. Scaling laws ignore measurement error, and as a result, the capabilities they extract are both uninterpretable and overfit to the observed benchmarks. This thesis presents the structured capabilities model, the first model to extract interpretable and generalisable capabilities from a large collection of LLM benchmark results. I fit this model and its two alternatives on a large sample of results from the OpenLLM Leaderboard. Structured capabilities outperform latent factor models on parsimonious fit indices, and exhibit better out-of-distribution benchmark prediction than scaling laws. These improvements are possible because neither existing approach separates model scale from capabilities in the appropriate way. Model scale should inform capabilities, as in scaling laws, and these capabilities should inform observed results up to measurement error, as in latent factor models. In combining these two insights, structured capabilities demonstrate better explanatory and predictive power for quantifying construct validity in LLM evaluations.
Paper Structure (35 sections, 26 equations, 13 figures, 3 tables)

This paper contains 35 sections, 26 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: High-level picture of my contribution. My proposed model, the structured capabilities model, addresses problems in both latent factor and observational scaling law approaches. These existing approaches do not sufficiently separate model scale from capability estimates, which harms both their explanatory and their predictive power.
  • Figure 2: The scale of language model training, 2010 to 2025. The dashed black line shows the introduction of the Transformer in 2017. The solid red lines show trends for training compute before the Transformer (order of magnitude increase every 5 years) and after the Transformer (order of magnitude increase every 1.2 years). Influential LLMs are annotated on the plot with red points. Data from Epoch AI EpochAIModels2025.
  • Figure 3: Illustrated three-parameter item response curve. This curve represents \ref{['eq:3pl_irt']}. "Difficulty" ($\beta = 0$) represents how difficult an item is relative to others by affecting a lateral transformation in $p(\theta)$ across all ability scores. "Discrimination" ($\alpha = 1$) represents the item's ability to differentiate a narrow range of ability scores by affecting the slope of the sigmoidal curve $\frac{d}{d\theta} p(\theta)$. "Guessing probability" ($c = 1/5 = 0.2$) represents the rate of success at $\lim_{\theta\to 0}p(\theta)$, which for multiple-choice questions will be the odds of guessing correctly from $1/c$ options.
  • Figure 4: Raw BBH score correlation matrix. The correlation metric is Spearman's rank correlation $\rho$. The labels along the $y$-axis are acronym abbreviations for the full subtask names, which are given along the $x$-axis.
  • Figure 5: Logistic scaling law fits for two subtasks. Ruin names has the highest observed correlation with the logistic fit ($R^2 = 0.52$); Web of lies has the lowest correlation ($R^2 = 0.12$).
  • ...and 8 more figures