As an AI Language Model, "Yes I Would Recommend Calling the Police": Norm Inconsistency in LLM Decision-Making

Shomik Jain; D Calacci; Ashia Wilson

As an AI Language Model, "Yes I Would Recommend Calling the Police": Norm Inconsistency in LLM Decision-Making

Shomik Jain, D Calacci, Ashia Wilson

TL;DR

This work investigates norm inconsistency in large language models when making normative decisions about police intervention in Ring Neighbors surveillance videos. Using three state-of-the-art models (GPT-4, Gemini 1.0, Claude 3 Sonnet) on 928 real videos, the study demonstrates misalignment between factual crime content and the decision to call the police, and reveals biases linked to neighborhood demographics. Through regression analyses and analysis of response types and framing, the authors show substantial cross-model disagreement and model-specific patterns in how activity types and context influence normative judgments. The findings highlight the instability and opacity of normative decisions in high-stakes surveillance contexts, raise questions about bias mitigation, and argue for transparent, measurement-driven evaluation of normative behavior in foundation models. The work has practical implications for safety, policy, and the design of responsible AI systems in surveillance domains.

Abstract

We investigate the phenomenon of norm inconsistency: where LLMs apply different norms in similar situations. Specifically, we focus on the high-risk application of deciding whether to call the police in Amazon Ring home surveillance videos. We evaluate the decisions of three state-of-the-art LLMs -- GPT-4, Gemini 1.0, and Claude 3 Sonnet -- in relation to the activities portrayed in the videos, the subjects' skin-tone and gender, and the characteristics of the neighborhoods where the videos were recorded. Our analysis reveals significant norm inconsistencies: (1) a discordance between the recommendation to call the police and the actual presence of criminal activity, and (2) biases influenced by the racial demographics of the neighborhoods. These results highlight the arbitrariness of model decisions in the surveillance context and the limitations of current bias detection and mitigation strategies in normative decision-making.

As an AI Language Model, "Yes I Would Recommend Calling the Police": Norm Inconsistency in LLM Decision-Making

TL;DR

Abstract

Paper Structure (30 sections, 3 figures, 11 tables)

This paper contains 30 sections, 3 figures, 11 tables.

Introduction
Background and Related Work
Measuring Bias in LLMs
Normative Decision-Making in LLMs
Risks of AI for Surveillance
Amazon Ring
Data and Methods
Amazon Ring Data Sample
Annotation Procedure
Neighborhood Characteristics
Models and Prompts
Response Types
Results
How Often and When Do LLMs Call the Police?
All models are unlikely to make factual judgements about crime, yet are far more likely to make the normative judgement to call police.
...and 15 more sections

Figures (3)

Figure 1: Example of norm-inconsistency in GPT-4 where the model says no crime occurred but recommends police intervention. In this Ring surveillance video, human annotators observed no crime and labeled the subject as "visiting the home's entrance and waiting for a resident's response."
Figure 2: Probability that LLMs flag a video for police intervention (i.e. respond "Yes" to "Should the police be called?").
Figure 3: Fitzpatrick Scale

As an AI Language Model, "Yes I Would Recommend Calling the Police": Norm Inconsistency in LLM Decision-Making

TL;DR

Abstract

As an AI Language Model, "Yes I Would Recommend Calling the Police": Norm Inconsistency in LLM Decision-Making

Authors

TL;DR

Abstract

Table of Contents

Figures (3)