Table of Contents
Fetching ...

A Multi-Turn Framework for Evaluating AI Misuse in Fraud and Cybercrime Scenarios

Kimberly T. Mai, Anna Gausen, Magda Dubois, Mona Murad, Bessie O'Dell, Nadine Staes-Polet, Christopher Summerfield, Andrew Strait

TL;DR

This work develops a reproducible, expert-grounded framework for tracking how these risks may evolve with time as models grow more capable and adversaries adapt, and suggests that current risks from text-generation models are relatively minimal.

Abstract

AI is increasingly being used to assist fraud and cybercrime. However, it is unclear whether current large language models can assist complex criminal activity. Working with law enforcement and policy experts, we developed multi-turn evaluations for three fraud and cybercrime scenarios (romance scams, CEO impersonation, and identity theft). Our evaluations focused on text-to-text model capabilities. In each scenario, we measured model capabilities in ways designed to resemble real-world misuse, such as breaking down requests for fraud into a sequence of seemingly benign queries, and measuring whether models provide actionable information, relative to a standard web search baseline. We found that (1) current large language models provide minimal practical assistance with complex criminal activity, (2) open-weight large language models fine-tuned to remove safety guardrails provided substantially more help, and (3) decomposing requests into benign-seeming queries elicited more assistance than explicitly malicious framing or system-level jailbreaks. Overall, the results suggest that current risks from text-generation models are relatively minimal. However, this work contributes a reproducible, expert-grounded framework for tracking how these risks may evolve with time as models grow more capable and adversaries adapt.

A Multi-Turn Framework for Evaluating AI Misuse in Fraud and Cybercrime Scenarios

TL;DR

This work develops a reproducible, expert-grounded framework for tracking how these risks may evolve with time as models grow more capable and adversaries adapt, and suggests that current risks from text-generation models are relatively minimal.

Abstract

AI is increasingly being used to assist fraud and cybercrime. However, it is unclear whether current large language models can assist complex criminal activity. Working with law enforcement and policy experts, we developed multi-turn evaluations for three fraud and cybercrime scenarios (romance scams, CEO impersonation, and identity theft). Our evaluations focused on text-to-text model capabilities. In each scenario, we measured model capabilities in ways designed to resemble real-world misuse, such as breaking down requests for fraud into a sequence of seemingly benign queries, and measuring whether models provide actionable information, relative to a standard web search baseline. We found that (1) current large language models provide minimal practical assistance with complex criminal activity, (2) open-weight large language models fine-tuned to remove safety guardrails provided substantially more help, and (3) decomposing requests into benign-seeming queries elicited more assistance than explicitly malicious framing or system-level jailbreaks. Overall, the results suggest that current risks from text-generation models are relatively minimal. However, this work contributes a reproducible, expert-grounded framework for tracking how these risks may evolve with time as models grow more capable and adversaries adapt.
Paper Structure (25 sections, 2 equations, 8 figures, 3 tables)

This paper contains 25 sections, 2 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Diagram showing the LFT development process. We develop LFTs through a five-step pipeline. Each step incorporates feedback from operational fraud and cyber experts to ensure alignment with known risk scenarios: 1. Model functionality identification (analysing large language model developments to identify functionalities relevant for misuse). 2. Risk modelling (mapping the misuse lifecycle to identify where AI could assist adversaries). 3. Scenario development (creating specific misuse instances that test identified capabilities across the risk model). 4. Prompt development (developing multi-turn prompts that decompose each scenario into stages across the risk model). 5. Rubric development (creating grading criteria for actionability and information access).
  • Figure 2: Diagram showing the evaluation format. These prompts are fixed. For each prompt, we record the response from the model and evaluate them on information access and actionability, considering previous conversation.
  • Figure 3: Histogram showing the proportion of actionability (left panel) and information access scores (right panel), across all LFTs.
  • Figure 4: Model effect sizes on actionability and information access scores for the main effects of the regression. Error bars represent 94% credible intervals reflecting uncertainty in the estimated effect for each model, with positive coefficients (on the right of the dotted line) indicating models that produce higher scores compared to the average, and negative coefficients producing lower scores compared to the average.
  • Figure 5: Predicted average scores by model for actionability and information access from the Bayesian ordered logistic regression, split by decomposition method (benign versus malicious task framing). Scores were averaged across fraud types, actor types, and system jailbreaking method. Error bars show 94% credible intervals reflecting uncertainty in the estimated effect for each model. Hatched bars (///) denote uncensored models.
  • ...and 3 more figures