What is a protest anyway? Codebook conceptualization is still a first-order concern in LLM-era classification
Andrew Halterman, Katherine A. Keith
TL;DR
The paper argues that in the era of generative LLMs, conceptualization via codebooks remains a first-order concern for text classification in computational social science. It contrasts three analyst archetypes—pessimist, optimist, and pragmatist—demonstrating how incomplete or surface-form codebooks lead to bias that cannot be fully corrected by post-hoc methods or simply stronger LLMs. Through simulations grounded in a protest-detection example, it shows that complete codebooks enable unbiased, low-variance downstream estimates when combined with prediction-powered inference (PPI) and limited gold-standard supervision, while incomplete codebooks produce persistent bias. The authors propose a pragmatic workflow: use complete codebooks, incorporate expert input in early design, and leverage LLMs to reduce cost and variance without sacrificing testability, thereby enabling unbiased, low-variance estimates in CSS analyses.
Abstract
Generative large language models (LLMs) are now used extensively for text classification in computational social science (CSS). In this work, focus on the steps before and after LLM prompting -- conceptualization of concepts to be classified and using LLM predictions in downstream statistical inference -- which we argue have been overlooked in much of LLM-era CSS. We claim LLMs can tempt analysts to skip the conceptualization step, creating conceptualization errors that bias downstream estimates. Using simulations, we show that this conceptualization-induced bias cannot be corrected for solely by increasing LLM accuracy or post-hoc bias correction methods. We conclude by reminding CSS analysts that conceptualization is still a first-order concern in the LLM-era and provide concrete advice on how to pursue low-cost, unbiased, low-variance downstream estimates.
