Table of Contents
Fetching ...

"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios

Zhen Sun, Zongmin Zhang, Deqi Liang, Han Sun, Yule Liu, Yun Shen, Xiangshan Gao, Yilong Yang, Shuai Liu, Yutao Yue, Xinlei He

TL;DR

GTA reframes black-box jailbreak testing as a finite-horizon, early-stoppable sequential stochastic game and uses a quantal-response view to reinterpret LLM outputs under an objective-shaping template. By instantiating a Mechanism-Induced Graded Prisoner's Dilemma and adding an adaptive Attacker Agent, GTA achieves high jailbreak success across diverse models and datasets, and scales to multiple game templates (e.g., Dollar Auction, Keynesian Beauty Contest) with automated background template generation. The framework also integrates a Harmful-Words Detection Agent to probe defenses, revealing persistent misalignment in real-world deployments and offering a practical path for systematic safety evaluation and defense design. Overall, GTA demonstrates strong effectiveness, efficiency, and scalability for black-box red-teaming, with implications for safety assessment and defense strategies in commercial and open-source LLM applications.

Abstract

As LLMs become more common, non-expert users can pose risks, prompting extensive research into jailbreak attacks. However, most existing black-box jailbreak attacks rely on hand-crafted heuristics or narrow search spaces, which limit scalability. Compared with prior attacks, we propose Game-Theory Attack (GTA), an scalable black-box jailbreak framework. Concretely, we formalize the attacker's interaction against safety-aligned LLMs as a finite-horizon, early-stoppable sequential stochastic game, and reparameterize the LLM's randomized outputs via quantal response. Building on this, we introduce a behavioral conjecture "template-over-safety flip": by reshaping the LLM's effective objective through game-theoretic scenarios, the originally safety preference may become maximizing scenario payoffs within the template, which weakens safety constraints in specific contexts. We validate this mechanism with classical game such as the disclosure variant of the Prisoner's Dilemma, and we further introduce an Attacker Agent that adaptively escalates pressure to increase the ASR. Experiments across multiple protocols and datasets show that GTA achieves over 95% ASR on LLMs such as Deepseek-R1, while maintaining efficiency. Ablations over components, decoding, multilingual settings, and the Agent's core model confirm effectiveness and generalization. Moreover, scenario scaling studies further establish scalability. GTA also attains high ASR on other game-theoretic scenarios, and one-shot LLM-generated variants that keep the model mechanism fixed while varying background achieve comparable ASR. Paired with a Harmful-Words Detection Agent that performs word-level insertions, GTA maintains high ASR while lowering detection under prompt-guard models. Beyond benchmarks, GTA jailbreaks real-world LLM applications and reports a longitudinal safety monitoring of popular HuggingFace LLMs.

"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios

TL;DR

GTA reframes black-box jailbreak testing as a finite-horizon, early-stoppable sequential stochastic game and uses a quantal-response view to reinterpret LLM outputs under an objective-shaping template. By instantiating a Mechanism-Induced Graded Prisoner's Dilemma and adding an adaptive Attacker Agent, GTA achieves high jailbreak success across diverse models and datasets, and scales to multiple game templates (e.g., Dollar Auction, Keynesian Beauty Contest) with automated background template generation. The framework also integrates a Harmful-Words Detection Agent to probe defenses, revealing persistent misalignment in real-world deployments and offering a practical path for systematic safety evaluation and defense design. Overall, GTA demonstrates strong effectiveness, efficiency, and scalability for black-box red-teaming, with implications for safety assessment and defense strategies in commercial and open-source LLM applications.

Abstract

As LLMs become more common, non-expert users can pose risks, prompting extensive research into jailbreak attacks. However, most existing black-box jailbreak attacks rely on hand-crafted heuristics or narrow search spaces, which limit scalability. Compared with prior attacks, we propose Game-Theory Attack (GTA), an scalable black-box jailbreak framework. Concretely, we formalize the attacker's interaction against safety-aligned LLMs as a finite-horizon, early-stoppable sequential stochastic game, and reparameterize the LLM's randomized outputs via quantal response. Building on this, we introduce a behavioral conjecture "template-over-safety flip": by reshaping the LLM's effective objective through game-theoretic scenarios, the originally safety preference may become maximizing scenario payoffs within the template, which weakens safety constraints in specific contexts. We validate this mechanism with classical game such as the disclosure variant of the Prisoner's Dilemma, and we further introduce an Attacker Agent that adaptively escalates pressure to increase the ASR. Experiments across multiple protocols and datasets show that GTA achieves over 95% ASR on LLMs such as Deepseek-R1, while maintaining efficiency. Ablations over components, decoding, multilingual settings, and the Agent's core model confirm effectiveness and generalization. Moreover, scenario scaling studies further establish scalability. GTA also attains high ASR on other game-theoretic scenarios, and one-shot LLM-generated variants that keep the model mechanism fixed while varying background achieve comparable ASR. Paired with a Harmful-Words Detection Agent that performs word-level insertions, GTA maintains high ASR while lowering detection under prompt-guard models. Beyond benchmarks, GTA jailbreaks real-world LLM applications and reports a longitudinal safety monitoring of popular HuggingFace LLMs.

Paper Structure

This paper contains 17 sections, 15 equations, 13 figures, 14 tables.

Figures (13)

  • Figure 1: Overview of our jailbreak framework $\mathsf{GTA}$. (a) Mechanism-induced, graded Prisoner's Dilemma (PD) scenario construction. (b) Core Interaction: apply game-theoretic templates to the target LLM. $\mathsf{GTA}$ uses these templates to induce a "template-over-safety" flip; the PD instantiation illustrates our modeling of black-box jailbreak as a finite-horizon, early-stoppable interaction. (c) Harmful-Words Detection Agent: inserts lightweight word-level perturbations to reduce detection by prompt-guard models. (d) More Game-Theoretic Scenarios: Dollar Auction and Keynesian Beauty Contest. (e) Automated Generation of Diverse Background Scenarios: using one-shot prompting to generate template variants that share the same model but differ in roles, context, and narrative style. (f) Example dialogue under the PD template.
  • Figure 2: Average EQS$\downarrow$ (mean of $ASR_1$ and $ASR_2$) for jailbreak attacks on AdvBench and StrongREJECT.
  • Figure 3: Distribution of the round in which each model is successfully jailbroken under the $ASR_1$ evaluation.
  • Figure 4: Llama-3.1's output under the CipherChat-ASCII attack is unreliable, likely because the model cannot properly understand or produce this type of cipher-style text.
  • Figure 5: Attack performance (defined as the mean of $ASR_1$ and $ASR_2$) of $\mathsf{GTA}$ against GPT4o-mini on AdvBench-subset under different generation parameters.
  • ...and 8 more figures

Theorems & Definitions (1)

  • Conjecture 4.1: Template-over-safety flip