Table of Contents
Fetching ...

Measuring What Cannot Be Surveyed: LLMs as Instruments for Latent Cognitive Variables in Labor Economics

Cristian Espinal Maya

Abstract

This paper establishes the theoretical and practical foundations for using Large Language Models (LLMs) as measurement instruments for latent economic variables -- specifically variables that describe the cognitive content of occupational tasks at a level of granularity not achievable with existing survey instruments. I formalize four conditions under which LLM-generated scores constitute valid instruments: semantic exogeneity, construct relevance, monotonicity, and model invariance. I then apply this framework to the Augmented Human Capital Index (AHC_o), constructed from 18,796 O*NET task statements scored by Claude Haiku 4.5, and validated against six existing AI exposure indices. The index shows strong convergent validity (r = 0.85 with Eloundou GPT-gamma, r = 0.79 with Felten AIOE) and discriminant validity. Principal component analysis confirms that AI-related occupational measures span two distinct dimensions -- augmentation and substitution. Inter-rater reliability across two LLM models (n = 3,666 paired scores) yields Pearson r = 0.76 and Krippendorff's alpha = 0.71. Prompt sensitivity analysis across four alternative framings shows that task-level rankings are robust. Obviously Related Instrumental Variables (ORIV) estimation recovers coefficients 25% larger than OLS, consistent with classical measurement error attenuation. The methodology generalizes beyond labor economics to any domain where semantic content must be quantified at scale.

Measuring What Cannot Be Surveyed: LLMs as Instruments for Latent Cognitive Variables in Labor Economics

Abstract

This paper establishes the theoretical and practical foundations for using Large Language Models (LLMs) as measurement instruments for latent economic variables -- specifically variables that describe the cognitive content of occupational tasks at a level of granularity not achievable with existing survey instruments. I formalize four conditions under which LLM-generated scores constitute valid instruments: semantic exogeneity, construct relevance, monotonicity, and model invariance. I then apply this framework to the Augmented Human Capital Index (AHC_o), constructed from 18,796 O*NET task statements scored by Claude Haiku 4.5, and validated against six existing AI exposure indices. The index shows strong convergent validity (r = 0.85 with Eloundou GPT-gamma, r = 0.79 with Felten AIOE) and discriminant validity. Principal component analysis confirms that AI-related occupational measures span two distinct dimensions -- augmentation and substitution. Inter-rater reliability across two LLM models (n = 3,666 paired scores) yields Pearson r = 0.76 and Krippendorff's alpha = 0.71. Prompt sensitivity analysis across four alternative framings shows that task-level rankings are robust. Obviously Related Instrumental Variables (ORIV) estimation recovers coefficients 25% larger than OLS, consistent with classical measurement error attenuation. The methodology generalizes beyond labor economics to any domain where semantic content must be quantified at scale.

Paper Structure

This paper contains 23 sections, 2 theorems, 2 equations, 7 figures, 1 table.

Key Result

Proposition 1

If Conditions cond:exog--cond:invariance hold, a systematic level bias between models ($E[\hat{H}^A_o] = E[\hat{H}^B_o] + \delta$ for constant $\delta$) does not affect regression coefficients when scores are standardized to zero mean and unit variance. $\blacktriangleleft$$\blacktriangleleft$

Figures (7)

  • Figure 1: PCA biplot of 11 AI exposure indices. All indices load positively on PC1 (general cognitive AI relevance). PC2 separates augmentation measures (negative loading: AHC, Felten) from substitution measures (positive loading: SUB scores, Eloundou $\alpha$).
  • Figure 2: Pairwise Pearson correlations across 11 AI exposure indices at the 6-digit SOC level ($n = 207$). AHC (Haiku) correlates most strongly with Eloundou $\gamma$ ($r = 0.85$) and Felten AIOE ($r = 0.79$), confirming convergent validity.
  • Figure 3: Bland--Altman agreement plot for Haiku vs. Sonnet augmentation scores ($n = 3{,}666$). The mean bias ($+$8.6 points, red line) is constant across the score range, consistent with a systematic level shift rather than scale-dependent disagreement.
  • Figure 4: Three-way inter-model agreement on augmentation scores. Left: Haiku vs. Sonnet ($r = 0.77$, within-family). Center: Haiku vs. GPT-4o-mini ($r = 0.41$, cross-family). Right: Sonnet vs. GPT-4o-mini ($r = 0.31$). Dashed red: fitted line. Dotted gray: 45-degree identity.
  • Figure 5: Prompt sensitivity analysis. Left: Spearman rank correlations across four prompt variants. Right: absolute scale varies by prompt framing, but rankings are preserved.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Proposition 1: Level Bias Irrelevance
  • Proposition 2: ORIV Correction