Using Instruction-Tuned Large Language Models to Identify Indicators of Vulnerability in Police Incident Narratives

Sam Relins; Daniel Birks; Charlie Lloyd

Using Instruction-Tuned Large Language Models to Identify Indicators of Vulnerability in Police Incident Narratives

Sam Relins, Daniel Birks, Charlie Lloyd

TL;DR

This study addresses how instruction-tuned LLMs can identify vulnerability indicators in unstructured police narratives by comparing outputs to human coders across four categories and testing multiple prompts and model sizes. Using a Boston narrative dataset, it demonstrates that IT-LLMs can effectively screen out narratives lacking vulnerabilities with high precision, while positive and inconclusive classifications are less reliable and benefit from human review. The authors report low demographic biases after correction and show that smaller models with carefully crafted Custom prompts can rival larger models with Codebook prompts, offering practical avenues for secure, scalable qualitative coding. Overall, IT-LLMs can augment traditional qualitative methods, enabling scalable analysis of large free-text datasets while preserving transparency and replicability, though they should not replace expert judgment for ambiguous cases or decision-making at the individual level.

Abstract

Objectives: Compare qualitative coding of instruction tuned large language models (IT-LLMs) against human coders in classifying the presence or absence of vulnerability in routinely collected unstructured text that describes police-public interactions. Evaluate potential bias in IT-LLM codings. Methods: Analyzing publicly available text narratives of police-public interactions recorded by Boston Police Department, we provide humans and IT-LLMs with qualitative labelling codebooks and compare labels generated by both, seeking to identify situations associated with (i) mental ill health; (ii) substance misuse; (iii) alcohol dependence; and (iv) homelessness. We explore multiple prompting strategies and model sizes, and the variability of labels generated by repeated prompts. Additionally, to explore model bias, we utilize counterfactual methods to assess the impact of two protected characteristics - race and gender - on IT-LLM classification. Results: Results demonstrate that IT-LLMs can effectively support human qualitative coding of police incident narratives. While there is some disagreement between LLM and human generated labels, IT-LLMs are highly effective at screening narratives where no vulnerabilities are present, potentially vastly reducing the requirement for human coding. Counterfactual analyses demonstrate that manipulations to both gender and race of individuals described in narratives have very limited effects on IT-LLM classifications beyond those expected by chance. Conclusions: IT-LLMs offer effective means to augment human qualitative coding in a way that requires much lower levels of resource to analyze large unstructured datasets. Moreover, they encourage specificity in qualitative coding, promote transparency, and provide the opportunity for more standardized, replicable approaches to analyzing large free-text police data sources.

Using Instruction-Tuned Large Language Models to Identify Indicators of Vulnerability in Police Incident Narratives

TL;DR

Abstract

Using Instruction-Tuned Large Language Models to Identify Indicators of Vulnerability in Police Incident Narratives

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)