Missing the Margins: A Systematic Literature Review on the Demographic Representativeness of LLMs
Indira Sen, Marlene Lutz, Elisa Rogers, David Garcia, Markus Strohmaier
TL;DR
This systematic literature review interrogates how demographic representativeness of LLMs is defined, measured, and reported across 211 empirical studies. It finds widespread underreporting of target populations and demographic subcategories, a heavy U.S.-centering of analyses, and a mix of positive, partial, and negative conclusions about representativeness, with many positive claims lacking disaggregated analyses. The authors argue that this inflates perceived representativeness and propose concrete benchmarks, explicit population definitions, and demographically disaggregated evaluation to improve reliability and social responsibility in LLM deployment. The work provides a publicly available annotated corpus and code to enable reproducibility and future meta-analyses.
Abstract
Many applications of Large Language Models (LLMs) require them to either simulate people or offer personalized functionality, making the demographic representativeness of LLMs crucial for equitable utility. At the same time, we know little about the extent to which these models actually reflect the demographic attributes and behaviors of certain groups or populations, with conflicting findings in empirical research. To shed light on this debate, we review 211 papers on the demographic representativeness of LLMs. We find that while 29% of the studies report positive conclusions on the representativeness of LLMs, 30% of these do not evaluate LLMs across multiple demographic categories or within demographic subcategories. Another 35% and 47% of the papers concluding positively fail to specify these subcategories altogether for gender and race, respectively. Of the articles that do report subcategories, fewer than half include marginalized groups in their study. Finally, more than a third of the papers do not define the target population to whom their findings apply; of those that do define it either implicitly or explicitly, a large majority study only the U.S. Taken together, our findings suggest an inflated perception of LLM representativeness in the broader community. We recommend more precise evaluation methods and comprehensive documentation of demographic attributes to ensure the responsible use of LLMs for social applications. Our annotated list of papers and analysis code is publicly available.
