Table of Contents
Fetching ...

What is a protest anyway? Codebook conceptualization is still a first-order concern in LLM-era classification

Andrew Halterman, Katherine A. Keith

TL;DR

The paper argues that in the era of generative LLMs, conceptualization via codebooks remains a first-order concern for text classification in computational social science. It contrasts three analyst archetypes—pessimist, optimist, and pragmatist—demonstrating how incomplete or surface-form codebooks lead to bias that cannot be fully corrected by post-hoc methods or simply stronger LLMs. Through simulations grounded in a protest-detection example, it shows that complete codebooks enable unbiased, low-variance downstream estimates when combined with prediction-powered inference (PPI) and limited gold-standard supervision, while incomplete codebooks produce persistent bias. The authors propose a pragmatic workflow: use complete codebooks, incorporate expert input in early design, and leverage LLMs to reduce cost and variance without sacrificing testability, thereby enabling unbiased, low-variance estimates in CSS analyses.

Abstract

Generative large language models (LLMs) are now used extensively for text classification in computational social science (CSS). In this work, focus on the steps before and after LLM prompting -- conceptualization of concepts to be classified and using LLM predictions in downstream statistical inference -- which we argue have been overlooked in much of LLM-era CSS. We claim LLMs can tempt analysts to skip the conceptualization step, creating conceptualization errors that bias downstream estimates. Using simulations, we show that this conceptualization-induced bias cannot be corrected for solely by increasing LLM accuracy or post-hoc bias correction methods. We conclude by reminding CSS analysts that conceptualization is still a first-order concern in the LLM-era and provide concrete advice on how to pursue low-cost, unbiased, low-variance downstream estimates.

What is a protest anyway? Codebook conceptualization is still a first-order concern in LLM-era classification

TL;DR

The paper argues that in the era of generative LLMs, conceptualization via codebooks remains a first-order concern for text classification in computational social science. It contrasts three analyst archetypes—pessimist, optimist, and pragmatist—demonstrating how incomplete or surface-form codebooks lead to bias that cannot be fully corrected by post-hoc methods or simply stronger LLMs. Through simulations grounded in a protest-detection example, it shows that complete codebooks enable unbiased, low-variance downstream estimates when combined with prediction-powered inference (PPI) and limited gold-standard supervision, while incomplete codebooks produce persistent bias. The authors propose a pragmatic workflow: use complete codebooks, incorporate expert input in early design, and leverage LLMs to reduce cost and variance without sacrificing testability, thereby enabling unbiased, low-variance estimates in CSS analyses.

Abstract

Generative large language models (LLMs) are now used extensively for text classification in computational social science (CSS). In this work, focus on the steps before and after LLM prompting -- conceptualization of concepts to be classified and using LLM predictions in downstream statistical inference -- which we argue have been overlooked in much of LLM-era CSS. We claim LLMs can tempt analysts to skip the conceptualization step, creating conceptualization errors that bias downstream estimates. Using simulations, we show that this conceptualization-induced bias cannot be corrected for solely by increasing LLM accuracy or post-hoc bias correction methods. We conclude by reminding CSS analysts that conceptualization is still a first-order concern in the LLM-era and provide concrete advice on how to pursue low-cost, unbiased, low-variance downstream estimates.

Paper Structure

This paper contains 29 sections, 11 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Text analysis with codebooks and LLMs. Conceptualization transforms a background concept (e.g., "a protest") to a systematized concept written in a codebook. In the current LLM-era, operationalization consists of an LLM which takes as input the codebook and documents to make predictions. Predictions are then used in downstream estimates of, e.g., the mean (prevalence) or correlation (regression) with other variables.
  • Figure 2: Different stipulative definitions of Protest from real-world codebooks. We manually categorize aspects of protest definitions from the codebooks of ACE doddington2004automatic, ACLED raleigh2010introducing, CAMEO gerner2002conflict, and the ccc2024 (CCC). The length of the definitions also varies: from around 40 (white-space) tokens in ACLED, to around 100 for CCC and ACLED, to over 700 for CAMEO. See full definitions in §\ref{['app:protest-defn']}.
  • Figure 3: DSL-corrected estimates from simulated data (§\ref{['sec:simulation']}); we display the mean estimate (dot) and 95% empirical intervals across $250$ simulations (bars). $N=10,000$ per simulation, and the true effect is the dashed line. Takeaway: Decreasing operationalization error reduces variance, but use of an incomplete codebook always results in biased estimates.
  • Figure A1: Example entry for the CAMEO codebook, illustrating the hierarchical definition of Protest. The "protest" class includes "protest violently, riot" as a subclass, which has a further subclass of "engage in violent protest for leadership change".
  • Figure A2: Simulation results comparing four estimation strategies across "complete" and "incomplete" codebook conditions across 50 simulations with $N = 10,000$ documents and a regression model where peaceful protests have a positive effect and violent protests have a negative effect. Results in blue (repeated in each panel) show estimates from expert annotation of all $N$ documents. "Small N" and "DSL" conditions use 10% expert annotated documents, and "LLM" uses labels for all $N$ documents with 10% random error. The "incomplete" codebook omits the instruction to exclude violent protests.