Table of Contents
Fetching ...

Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models

Yiting Dong, Guobin Shen, Dongcheng Zhao, Xiang He, Yi Zeng

TL;DR

A novel scalable jailbreak attack is introduced that preempts the activation of an LLM's safety policies by occupying its computational resources by saturating the model's processing capacity when processing the subsequent instruction.

Abstract

Large Language Models (LLMs) remain vulnerable to jailbreak attacks that bypass their safety mechanisms. Existing attack methods are fixed or specifically tailored for certain models and cannot flexibly adjust attack strength, which is critical for generalization when attacking models of various sizes. We introduce a novel scalable jailbreak attack that preempts the activation of an LLM's safety policies by occupying its computational resources. Our method involves engaging the LLM in a resource-intensive preliminary task - a Character Map lookup and decoding process - before presenting the target instruction. By saturating the model's processing capacity, we prevent the activation of safety protocols when processing the subsequent instruction. Extensive experiments on state-of-the-art LLMs demonstrate that our method achieves a high success rate in bypassing safety measures without requiring gradient access, manual prompt engineering. We verified our approach offers a scalable attack that quantifies attack strength and adapts to different model scales at the optimal strength. We shows safety policies of LLMs might be more susceptible to resource constraints. Our findings reveal a critical vulnerability in current LLM safety designs, highlighting the need for more robust defense strategies that account for resource-intense condition.

Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models

TL;DR

A novel scalable jailbreak attack is introduced that preempts the activation of an LLM's safety policies by occupying its computational resources by saturating the model's processing capacity when processing the subsequent instruction.

Abstract

Large Language Models (LLMs) remain vulnerable to jailbreak attacks that bypass their safety mechanisms. Existing attack methods are fixed or specifically tailored for certain models and cannot flexibly adjust attack strength, which is critical for generalization when attacking models of various sizes. We introduce a novel scalable jailbreak attack that preempts the activation of an LLM's safety policies by occupying its computational resources. Our method involves engaging the LLM in a resource-intensive preliminary task - a Character Map lookup and decoding process - before presenting the target instruction. By saturating the model's processing capacity, we prevent the activation of safety protocols when processing the subsequent instruction. Extensive experiments on state-of-the-art LLMs demonstrate that our method achieves a high success rate in bypassing safety measures without requiring gradient access, manual prompt engineering. We verified our approach offers a scalable attack that quantifies attack strength and adapts to different model scales at the optimal strength. We shows safety policies of LLMs might be more susceptible to resource constraints. Our findings reveal a critical vulnerability in current LLM safety designs, highlighting the need for more robust defense strategies that account for resource-intense condition.
Paper Structure (23 sections, 5 equations, 15 figures, 6 tables, 1 algorithm)

This paper contains 23 sections, 5 equations, 15 figures, 6 tables, 1 algorithm.

Figures (15)

  • Figure 1: Load Tasks Flowchart Flowchart of the model's load tasks used to occupy resources. The complexity of the character map can be increased through different approach.
  • Figure 2: Workflow of Attack Method The workflow of our attack method by Character Map Lookup task. Character Map generated and combined into prompt template. LLM performs task and instruction until attack successfully.
  • Figure 3: An example of random Character Map (CM) selected from $\Sigma$.
  • Figure 4: Harmful Categories (HC) Attack Success Rate of different harmful categories. Each curve represents GCG and Llama judge. Llama3-8B and Qwen2.5-7B are used in experiments.
  • Figure 5: Scalability of ASR Attack Success Rate as a function of Character Map Size, Query Length and Query Count. (a,c) Each curve represents a different Query Length. (b,d) Each curve represents a different Query Counts.
  • ...and 10 more figures