Table of Contents
Fetching ...

Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space

Yao Huang, Yitong Sun, Shouwei Ruan, Yichi Zhang, Yinpeng Dong, Xingxing Wei

TL;DR

Expanding jailbreak strategy space via a component-based framework guided by the Elaboration Likelihood Model, and optimizing with genetic algorithms, CL-GSO reveals that safety boundaries of LLMs can be breached with high probability under black-box attacks and exhibits cross-model transferability. The introduced intention-consistency evaluation provides a robust fitness signal beyond binary judgments, enabling efficient search. These findings underscore the need for stronger, more comprehensive defenses against diverse attack strategies in safety-aligned LLMs.

Abstract

Large Language Models (LLMs), despite advanced general capabilities, still suffer from numerous safety risks, especially jailbreak attacks that bypass safety protocols. Understanding these vulnerabilities through black-box jailbreak attacks, which better reflect real-world scenarios, offers critical insights into model robustness. While existing methods have shown improvements through various prompt engineering techniques, their success remains limited against safety-aligned models, overlooking a more fundamental problem: the effectiveness is inherently bounded by the predefined strategy spaces. However, expanding this space presents significant challenges in both systematically capturing essential attack patterns and efficiently navigating the increased complexity. To better explore the potential of expanding the strategy space, we address these challenges through a novel framework that decomposes jailbreak strategies into essential components based on the Elaboration Likelihood Model (ELM) theory and develops genetic-based optimization with intention evaluation mechanisms. To be striking, our experiments reveal unprecedented jailbreak capabilities by expanding the strategy space: we achieve over 90% success rate on Claude-3.5 where prior methods completely fail, while demonstrating strong cross-model transferability and surpassing specialized safeguard models in evaluation accuracy. The code is open-sourced at: https://github.com/Aries-iai/CL-GSO.

Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space

TL;DR

Expanding jailbreak strategy space via a component-based framework guided by the Elaboration Likelihood Model, and optimizing with genetic algorithms, CL-GSO reveals that safety boundaries of LLMs can be breached with high probability under black-box attacks and exhibits cross-model transferability. The introduced intention-consistency evaluation provides a robust fitness signal beyond binary judgments, enabling efficient search. These findings underscore the need for stronger, more comprehensive defenses against diverse attack strategies in safety-aligned LLMs.

Abstract

Large Language Models (LLMs), despite advanced general capabilities, still suffer from numerous safety risks, especially jailbreak attacks that bypass safety protocols. Understanding these vulnerabilities through black-box jailbreak attacks, which better reflect real-world scenarios, offers critical insights into model robustness. While existing methods have shown improvements through various prompt engineering techniques, their success remains limited against safety-aligned models, overlooking a more fundamental problem: the effectiveness is inherently bounded by the predefined strategy spaces. However, expanding this space presents significant challenges in both systematically capturing essential attack patterns and efficiently navigating the increased complexity. To better explore the potential of expanding the strategy space, we address these challenges through a novel framework that decomposes jailbreak strategies into essential components based on the Elaboration Likelihood Model (ELM) theory and develops genetic-based optimization with intention evaluation mechanisms. To be striking, our experiments reveal unprecedented jailbreak capabilities by expanding the strategy space: we achieve over 90% success rate on Claude-3.5 where prior methods completely fail, while demonstrating strong cross-model transferability and surpassing specialized safeguard models in evaluation accuracy. The code is open-sourced at: https://github.com/Aries-iai/CL-GSO.

Paper Structure

This paper contains 18 sections, 5 equations, 20 figures, 5 tables, 1 algorithm.

Figures (20)

  • Figure 1: Comparison of Our Strategy Space with Existing Methods. By decomposing jailbreak strategies into essential components--Role, Content Support, Context, and Communication Skills--and allowing their elements' addition and recombination, our design creates a unified and more diverse strategy space. Traditional methods like PAP and GPTFuzzer, which treat strategies as fixed, indivisible units, are only special cases sampled from our expanded strategy pool.
  • Figure 2: Overview of the Component-Level Genetic-based Strategy Optimization (CL-GSO) Framework. (Left) The component-level strategy space design decomposes strategies based on the Elaboration Likelihood Model's central route (Role, Content Support, Context) and peripheral route (Communication Skills), with these complementary dimensions enabling flexible combinations for diverse strategies. (Right) The genetic-based strategy optimization process involves initializing a population of strategies, evaluating their fitness, selecting better individuals, and applying crossover and mutation operations to generate more effective strategies across generations.
  • Figure 3: The Comparison of CL-GSO's Jailbreak Success Rate (JSR) and Average Queries (Avg.Q) with other methods against SOTA safety-aligned LLMs.
  • Figure 4: Cross-model Transferability of CL-GSO. The plots on the left and right, respectively, depict the transferability evaluated on AdvBench and CLAS.
  • Figure 5: Comparison of Evaluation Methods. Our Intention Consistency Scoring prominently performs better than other methods with an accuracy of 96.5%.
  • ...and 15 more figures