Recidivism and Peer Influence with LLM Text Embeddings in Low Security Correctional Facilities
Shanjukta Nath, Jiwon Hong, Jae Ho Chang, Keith Warren, Subhadeep Paul
TL;DR
This paper develops a unified framework that combines transformer-based language embeddings with novel econometric identification to study peer effects in language within low-security correctional facilities. By bridging high-dimensional text representations with endogenous-network IV methods and nonparametric sieve estimation, the authors show that language embeddings from LLMs predict 3-year recidivism about 30% better than pre-entry covariates and reveal significant language-based peer effects. The approach robustly handles sparse networks and multidimensional latent homophily, achieving $\sqrt{N}$-consistent, asymptotically normal estimates and providing insight into the mechanisms by which peers influence language use. The findings underscore the predictive value of language data for downstream outcomes and illustrate how AI-derived text features can complement traditional covariates in policy-relevant settings, while highlighting the need for causal identification through randomized or natural experiments.
Abstract
Studying peer effects in language is critical because they often reflect behavioral and personality traits that are important determinants of economic outcomes. However, language is unstructured, non-numeric, and high-dimensional. We combine Large Language Model (LLM) embeddings with structural econometric identification to provide a unified framework for identifying peer effects in language. This unified framework is applied to 80,000-120,000 written exchanges among residents of low security correctional facilities. The LLM language profiles predict three-year recidivism 30\% more accurately than pre-entry covariates alone, showing that text representations capture meaningful signals. We analyze peer effects on multidimensional language embeddings while addressing network endogeneity. We develop novel instrumental variable estimators for peer effects that accommodate multivariate outcomes, sparse networks, and multidimensional latent variables. Our methods achieve root-N consistency and asymptotic normality under realistic sparsity conditions, relaxing the dense-network assumption. Results reveal significant peer effects in residents' language profiles.
