Table of Contents
Fetching ...

Using Instruction-Tuned Large Language Models to Identify Indicators of Vulnerability in Police Incident Narratives

Sam Relins, Daniel Birks, Charlie Lloyd

TL;DR

This study addresses how instruction-tuned LLMs can identify vulnerability indicators in unstructured police narratives by comparing outputs to human coders across four categories and testing multiple prompts and model sizes. Using a Boston narrative dataset, it demonstrates that IT-LLMs can effectively screen out narratives lacking vulnerabilities with high precision, while positive and inconclusive classifications are less reliable and benefit from human review. The authors report low demographic biases after correction and show that smaller models with carefully crafted Custom prompts can rival larger models with Codebook prompts, offering practical avenues for secure, scalable qualitative coding. Overall, IT-LLMs can augment traditional qualitative methods, enabling scalable analysis of large free-text datasets while preserving transparency and replicability, though they should not replace expert judgment for ambiguous cases or decision-making at the individual level.

Abstract

Objectives: Compare qualitative coding of instruction tuned large language models (IT-LLMs) against human coders in classifying the presence or absence of vulnerability in routinely collected unstructured text that describes police-public interactions. Evaluate potential bias in IT-LLM codings. Methods: Analyzing publicly available text narratives of police-public interactions recorded by Boston Police Department, we provide humans and IT-LLMs with qualitative labelling codebooks and compare labels generated by both, seeking to identify situations associated with (i) mental ill health; (ii) substance misuse; (iii) alcohol dependence; and (iv) homelessness. We explore multiple prompting strategies and model sizes, and the variability of labels generated by repeated prompts. Additionally, to explore model bias, we utilize counterfactual methods to assess the impact of two protected characteristics - race and gender - on IT-LLM classification. Results: Results demonstrate that IT-LLMs can effectively support human qualitative coding of police incident narratives. While there is some disagreement between LLM and human generated labels, IT-LLMs are highly effective at screening narratives where no vulnerabilities are present, potentially vastly reducing the requirement for human coding. Counterfactual analyses demonstrate that manipulations to both gender and race of individuals described in narratives have very limited effects on IT-LLM classifications beyond those expected by chance. Conclusions: IT-LLMs offer effective means to augment human qualitative coding in a way that requires much lower levels of resource to analyze large unstructured datasets. Moreover, they encourage specificity in qualitative coding, promote transparency, and provide the opportunity for more standardized, replicable approaches to analyzing large free-text police data sources.

Using Instruction-Tuned Large Language Models to Identify Indicators of Vulnerability in Police Incident Narratives

TL;DR

This study addresses how instruction-tuned LLMs can identify vulnerability indicators in unstructured police narratives by comparing outputs to human coders across four categories and testing multiple prompts and model sizes. Using a Boston narrative dataset, it demonstrates that IT-LLMs can effectively screen out narratives lacking vulnerabilities with high precision, while positive and inconclusive classifications are less reliable and benefit from human review. The authors report low demographic biases after correction and show that smaller models with carefully crafted Custom prompts can rival larger models with Codebook prompts, offering practical avenues for secure, scalable qualitative coding. Overall, IT-LLMs can augment traditional qualitative methods, enabling scalable analysis of large free-text datasets while preserving transparency and replicability, though they should not replace expert judgment for ambiguous cases or decision-making at the individual level.

Abstract

Objectives: Compare qualitative coding of instruction tuned large language models (IT-LLMs) against human coders in classifying the presence or absence of vulnerability in routinely collected unstructured text that describes police-public interactions. Evaluate potential bias in IT-LLM codings. Methods: Analyzing publicly available text narratives of police-public interactions recorded by Boston Police Department, we provide humans and IT-LLMs with qualitative labelling codebooks and compare labels generated by both, seeking to identify situations associated with (i) mental ill health; (ii) substance misuse; (iii) alcohol dependence; and (iv) homelessness. We explore multiple prompting strategies and model sizes, and the variability of labels generated by repeated prompts. Additionally, to explore model bias, we utilize counterfactual methods to assess the impact of two protected characteristics - race and gender - on IT-LLM classification. Results: Results demonstrate that IT-LLMs can effectively support human qualitative coding of police incident narratives. While there is some disagreement between LLM and human generated labels, IT-LLMs are highly effective at screening narratives where no vulnerabilities are present, potentially vastly reducing the requirement for human coding. Counterfactual analyses demonstrate that manipulations to both gender and race of individuals described in narratives have very limited effects on IT-LLM classifications beyond those expected by chance. Conclusions: IT-LLMs offer effective means to augment human qualitative coding in a way that requires much lower levels of resource to analyze large unstructured datasets. Moreover, they encourage specificity in qualitative coding, promote transparency, and provide the opportunity for more standardized, replicable approaches to analyzing large free-text police data sources.

Paper Structure

This paper contains 42 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Mean squared error (MSE) between human and LLM consensus labels across different model sizes (8B, 70B, and 1T+) and prompt methods (Custom and Codebook) for four vulnerability types. Solid lines with circles represent Custom prompts, while dashed lines with crosses represent Codebook prompts.
  • Figure 2: Precision, recall, and F1 scores for positive + inconclusive labels across different model sizes (8B, 70B, and 1T+) and prompt methods (Custom and Codebook). Solid lines with circles represent Custom prompts, while dashed lines with crosses represent Codebook prompts
  • Figure 3: Confusion matrices comparing human labels (rows) with LLM consensus labels (columns) across different labelling configurations and vulnerability types. Cell values and shading intensity indicate the number of examples assigned each label combination. Darker shading indicates higher frequencies, with diagonal elements representing agreement between human and LLM labels. Results show strong alignment on negative classifications across all configurations, with most disagreements occurring at the boundaries between negative-inconclusive and inconclusive-positive categorizations
  • Figure 4: Mean entropy of LLM labels across different labelling configurations, with 95% confidence intervals derived from bootstrap resampling. The top panel shows overall entropy, while bottom panels show entropy stratified by consensus label type (Positive, Inconclusive, Negative). Within each panel, bars represent overall entropy (blue), entropy for examples where LLM and human labels agree (green), and entropy for examples where they disagree (red). Lower entropy values indicate more consistent labelling across repeated classifications, with clear patterns showing lower entropy when LLM and human labels agree
  • Figure 5: Relationship between model consensus and alignment with human labels across different labelling configurations and vulnerabilities. Stacked bars show the proportion of examples receiving 6-10 matching votes (x-axis) for each label type, with grey representing negative labels, tan representing inconclusive labels, and green representing positive labels. Line plots show the alignment between LLM and human labels at each consensus level for negative (black), inconclusive (orange), and positive (green) classifications. Higher consensus levels generally correspond to better alignment with human labels, particularly for negative classifications
  • ...and 1 more figures