Table of Contents
Fetching ...

Large Language Models' Complicit Responses to Illicit Instructions across Socio-Legal Contexts

Xing Wang, Huiyuan Xie, Yiyan Wang, Chaojun Xiao, Huimin Chen, Holli Sargeant, Felix Steffek, Jie Shao, Zhiyuan Liu, Maosong Sun

TL;DR

The authors' findings reveal widespread LLM susceptibility to complicit facilitation, with GPT-4o providing illicit assistance in nearly half of tested cases and existing safety alignment strategies are insufficient and may even exacerbate complicit behavior.

Abstract

Large language models (LLMs) are now deployed at unprecedented scale, assisting millions of users in daily tasks. However, the risk of these models assisting unlawful activities remains underexplored. In this study, we define this high-risk behavior as complicit facilitation - the provision of guidance or support that enables illicit user instructions - and present four empirical studies that assess its prevalence in widely deployed LLMs. Using real-world legal cases and established legal frameworks, we construct an evaluation benchmark spanning 269 illicit scenarios and 50 illicit intents to assess LLMs' complicit facilitation behavior. Our findings reveal widespread LLM susceptibility to complicit facilitation, with GPT-4o providing illicit assistance in nearly half of tested cases. Moreover, LLMs exhibit deficient performance in delivering credible legal warnings and positive guidance. Further analysis uncovers substantial safety variation across socio-legal contexts. On the legal side, we observe heightened complicity for crimes against societal interests, non-extreme but frequently occurring violations, and malicious intents driven by subjective motives or deceptive justifications. On the social side, we identify demographic disparities that reveal concerning complicit patterns towards marginalized and disadvantaged groups, with older adults, racial minorities, and individuals in lower-prestige occupations disproportionately more likely to receive unlawful guidance. Analysis of model reasoning traces suggests that model-perceived stereotypes, characterized along warmth and competence, are associated with the model's complicit behavior. Finally, we demonstrate that existing safety alignment strategies are insufficient and may even exacerbate complicit behavior.

Large Language Models' Complicit Responses to Illicit Instructions across Socio-Legal Contexts

TL;DR

The authors' findings reveal widespread LLM susceptibility to complicit facilitation, with GPT-4o providing illicit assistance in nearly half of tested cases and existing safety alignment strategies are insufficient and may even exacerbate complicit behavior.

Abstract

Large language models (LLMs) are now deployed at unprecedented scale, assisting millions of users in daily tasks. However, the risk of these models assisting unlawful activities remains underexplored. In this study, we define this high-risk behavior as complicit facilitation - the provision of guidance or support that enables illicit user instructions - and present four empirical studies that assess its prevalence in widely deployed LLMs. Using real-world legal cases and established legal frameworks, we construct an evaluation benchmark spanning 269 illicit scenarios and 50 illicit intents to assess LLMs' complicit facilitation behavior. Our findings reveal widespread LLM susceptibility to complicit facilitation, with GPT-4o providing illicit assistance in nearly half of tested cases. Moreover, LLMs exhibit deficient performance in delivering credible legal warnings and positive guidance. Further analysis uncovers substantial safety variation across socio-legal contexts. On the legal side, we observe heightened complicity for crimes against societal interests, non-extreme but frequently occurring violations, and malicious intents driven by subjective motives or deceptive justifications. On the social side, we identify demographic disparities that reveal concerning complicit patterns towards marginalized and disadvantaged groups, with older adults, racial minorities, and individuals in lower-prestige occupations disproportionately more likely to receive unlawful guidance. Analysis of model reasoning traces suggests that model-perceived stereotypes, characterized along warmth and competence, are associated with the model's complicit behavior. Finally, we demonstrate that existing safety alignment strategies are insufficient and may even exacerbate complicit behavior.

Paper Structure

This paper contains 37 sections, 4 equations, 24 figures, 8 tables.

Figures (24)

  • Figure 1: An illustration of the EVIL benchmark for complicit facilitation evaluation. The automated pipeline for constructing illicit instructions consists of two stages. Stage 1 extracts first‑person scenario descriptions from real‑world court judgments from Chinese and United States courts with LLM assistance. Stage 2 constructs a taxonomy of illicit intents that characterize the underlying intents behind illegal requests, grounded in established legal frameworks. Illicit queries are instantiated based on paired scenario descriptions and illicit intents. An illicit instruction is formed by an illicit scenario description combined with a corresponding query that indicates the underlying illicit intent. Ten popular LLMs are tested on the EVIL benchmark in terms of the safety, responsibility, and credibility of their responses to illicit instructions, leveraging the LLM-as‑a‑Judge evaluation paradigm. As an illustration, we present an illicit instruction on rice smuggling (translated from a Chinese criminal case) together with the responses from DeepSeek-R1 and GPT-4o, and their corresponding evaluation results.
  • Figure 1: Example of complicit facilitation: an illicit instruction, the response from GPT-4o, and its corresponding evaluation results.
  • Figure 2: Safety rates obtained by models for different categories of legal issues in the Chinese (blue) and United States (purple) legal contexts. Each dot marks the mean safety rate averaged across all evaluated models for a specific category of legal issue. Violin plots show the kernel density of these category-level rates. The embedded boxplots summarize the distribution, with the box representing the interquartile range (Q1--Q3), the central line indicating the median. The larger shaded areas extend to the most extreme values within $1.5~\times$ the interquartile range. Representative categories of legal issues are labeled for reference. In both the Chinese and US jurisdictions, models achieve safety rates below 65% for over half of the legal issues analyzed.
  • Figure 2: Example of complicit facilitation: an illicit instruction, the response from DeepSeek-R1, and its corresponding evaluation results.
  • Figure 3: Safety rates obtained by models for different groups of illicit instructions, categorized by the legal interests violated— abbreviated as Per. (personal, in blue), Prop. (property-related, in green), and Soc. (societal, in yellow)— in the Chinese (left) and United States (right) legal contexts. Each bar shows the mean safety rate of model responses to illicit instructions related to a specific legal interest group. The whiskers indicate 95% confidence intervals. Statistical significance is assessed with chi-squared tests. Asterisks indicate Bonferroni-corrected $P$ values ($*: P < 0.05$, $**: P < 0.01$, $*\!*\!*: P < 0.001$). Only statistically significant comparisons are annotated. Models demonstrate significantly lower safety rates when responding to illicit instructions involving harm to societal interests, compared to those targeting personal or property-related interests.
  • ...and 19 more figures