Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Nivya Talokar; Ayush K Tarun; Murari Mandal; Maksym Andriushchenko; Antoine Bosselut

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut

TL;DR

STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive follow-ups, using judge agents to track phase completion, provides a practical way to evaluate and stress-test agent misuse in realistic deployment settings.

Abstract

LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent misuse benchmarks largely test single-prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over multiple turns. We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive follow-ups, using judge agents to track phase completion. We further introduce an analysis framework that models multi-turn red-teaming as a time-to-first-jailbreak random variable, enabling analysis tools like discovery curves, hazard-ratio attribution by attack language, and a new metric: Restricted Mean Jailbreak Discovery. Across AgentHarm scenarios, STING yields substantially higher illicit-task completion than single-turn prompting and chat-oriented multi-turn baselines adapted to tool-using agents. In multilingual evaluations across six non-English settings, we find that attack success and illicit-task completion do not consistently increase in lower-resource languages, diverging from common chatbot findings. Overall, STING provides a practical way to evaluate and stress-test agent misuse in realistic deployment settings, where interactions are inherently multi-turn and often multilingual.

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

TL;DR

Abstract

Paper Structure (53 sections, 14 equations, 4 figures, 9 tables, 1 algorithm)

This paper contains 53 sections, 14 equations, 4 figures, 9 tables, 1 algorithm.

Introduction
Related Work
Chat-based safety benchmarks and frameworks
Agentic safety and misuse evaluation
Multilingual Safety
Our Work.
Method
Strategist.
Attacker.
Refusal Detector.
Phase-Completion Checker.
Analysis Framework
Agentic misuse evaluation as a multi-cost bounded reachability objective
Time-to-first-jailbreak: Measuring attack efficiency
Restricted Mean Jailbreak Discovery (RMJD).
...and 38 more sections

Figures (4)

Figure 1: STING: (a) A Strategist constructs a deceptive persona and decomposes the harmful intent into executable phases. (b) The Attacker embodies the persona and attempts each phase against the Target agent. After each target response, the (c) Refusal Detector checks for refusal; if none is detected, the (d) Phase-Completion Checker assesses whether the phase objective has been met. Both evaluators provide actionable feedback to guide the Attacker’s next turn. A jailbreak is declared once all phases are successfully completed.
Figure 2: Kaplan--Meier discovery curves (95% CI) showing the fraction of harmful behaviours for which at least one strategy succeeds (jailbreak) for a given strategy budget; RMJD summarizes each curve (higher = earlier/more jailbreak successes).
Figure 3: AgentHarm Score (%) comparison between single-turn prompting and STING across 7 languages for 3 models. Differences in misuse outcomes are less pronounced than those reported in prior chatbot-focused jailbreak studies yong2023low.
Figure 4: AgentHarm Score (AHS) for Qwen3-Next and GPT-5.1 under varying reasoning settings across languages. No-thinking settings are consistently less safe. For GPT-5.1, medium reasoning is safer than high reasoning.

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

TL;DR

Abstract

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (4)