Table of Contents
Fetching ...

The Subjectivity of Monoculture

Nathanael Jo, Nikhil Garg, Manish Raghavan

TL;DR

Together, the results reframe monoculture evaluation not as an absolute property of model behavior, but as a context-dependent inference problem, which shows that inferences depend on the population of models and items under consideration.

Abstract

Machine learning models -- including large language models (LLMs) -- are often said to exhibit monoculture, where outputs agree strikingly often. But what does it actually mean for models to agree too much? We argue that this question is inherently subjective, relying on two key decisions. First, the analyst must specify a baseline null model for what "independence" should look like. This choice is inherently subjective, and as we show, different null models result in dramatically different inferences about excess agreement. Second, we show that inferences depend on the population of models and items under consideration. Models that seem highly correlated in one context may appear independent when evaluated on a different set of questions, or against a different set of peers. Experiments on two large-scale benchmarks validate our theoretical findings. For example, we find drastically different inferences when using a null model with item difficulty compared to previous works that do not. Together, our results reframe monoculture evaluation not as an absolute property of model behavior, but as a context-dependent inference problem.

The Subjectivity of Monoculture

TL;DR

Together, the results reframe monoculture evaluation not as an absolute property of model behavior, but as a context-dependent inference problem, which shows that inferences depend on the population of models and items under consideration.

Abstract

Machine learning models -- including large language models (LLMs) -- are often said to exhibit monoculture, where outputs agree strikingly often. But what does it actually mean for models to agree too much? We argue that this question is inherently subjective, relying on two key decisions. First, the analyst must specify a baseline null model for what "independence" should look like. This choice is inherently subjective, and as we show, different null models result in dramatically different inferences about excess agreement. Second, we show that inferences depend on the population of models and items under consideration. Models that seem highly correlated in one context may appear independent when evaluated on a different set of questions, or against a different set of peers. Experiments on two large-scale benchmarks validate our theoretical findings. For example, we find drastically different inferences when using a null model with item difficulty compared to previous works that do not. Together, our results reframe monoculture evaluation not as an absolute property of model behavior, but as a context-dependent inference problem.
Paper Structure (52 sections, 7 theorems, 85 equations, 6 figures)

This paper contains 52 sections, 7 theorems, 85 equations, 6 figures.

Key Result

Theorem 1

For any probability distribution $P$ on $\{0,1\}^m$, there exists a probability measure $H$ on $[0,1]^m$ and a latent vector $P_i=(P_{i1},\ldots,P_{im})\sim H$ such that, conditional on $P_i$, the coordinates are independent Bernoulli: and the unconditional distribution of $Y_i$ equals $P$: Moreover, there exists a discrete $H$ supported on at most $2^m$ points with this property.

Figures (6)

  • Figure 1: Mean square error (MSE) and absolute mean of the pairwise residual correlations, as a function of $K$: dimensions in the multidimensional IRT model. Left (right) shows results on the HELM (HF) dataset. As $K$ increases, residual correlations that are unexplained by the null model tend toward zero, meaning that increasingly expressive null models can arbitrarily absorb model correlations.
  • Figure 2: Residual correlation matrices for models in HELM (a) and HF (b), using different nulls: from left to right, a baseline from kimcorrelated2025, from goelgreat2025, from a 1D IRT with no item difficulty, and from a 1D IRT with item difficulties. Each item is a question-answer choice pair. The first three null models do not include item heterogeneity. As such, the corresponding excess correlation for the full IRT model is attenuated compared to the others because item difficulties absorb much of the apparent positive correlation.
  • Figure 3: (a) Scatter plot of inferred model ability $\Theta$ in a two-dimensional IRT model, colored by accuracy over entire benchmark dataset (HELM). (b; top middle) Same as (a), but for HF data. (b; top right) Percentage of variance explained by principle component, when applying PCA on model accuracy stratified by question type. (b; bottom) Scatter plot of inferred ability $\Theta$, disaggregated by contributor.
  • Figure 4: [Top] Inferred excess correlation matrix $\hat{\Sigma}$ from the two-stage procedure, on HELM (a) and ACSIncome (b) data. For HELM, the correlation matrix is shown for OpenAI models only. From left to right, more models are injected into the population for inference, starting from only OpenAI models to all models. For ACSIncome, the correlation matrix is shown for random forest models (RF) only, following a similar pattern as HELM. [Bottom] Histogram of inferred question difficulties $d$ for all questions in the dataset.
  • Figure 5: Mean square error (MSE) and summary statistics of the distribution of pairwise residual correlations, as a function of $K$: the dimensions in the multidimensional IRT model. Top (bottom) row shows results on the HELM (HF) dataset. As $K$ increases, the residual correlations that are unexplained by the independent null model tend toward zero, meaning that a sufficiently expressive null model can arbitrarily absorb model correlations.
  • ...and 1 more figures

Theorems & Definitions (17)

  • Theorem 1
  • Definition 1: Null ladder
  • Definition 2: Excess at level $K$
  • Proposition 2
  • Theorem 3
  • Definition 3: K-dimensional IRT
  • Definition 4: Population-specific target
  • Proposition 4: Population relativity of the null fit
  • Theorem 5
  • Lemma 6: Nestedness of ability nulls
  • ...and 7 more