Table of Contents
Fetching ...

SQL Injection Jailbreak: A Structural Disaster of Large Language Models

Jiawei Zhao, Kejiang Chen, Weiming Zhang, Nenghai Yu

TL;DR

The paper identifies a novel external-prompt vulnerability in LLMs, SIJ, which exploits the prompt structure to coerce models into harmful outputs. It formalizes a multi-component attack pipeline—Pattern Control, Affirmative Prefix generation, triggers, and anomaly handling—and demonstrates high attack success across multiple open and closed models on AdvBench and HEx-PHI datasets. The authors compare SIJ to prior jailbreak methods, showing superior construction efficiency and comparable or higher harmful outputs, while offering a simple adaptive defense, Self-Reminder-Key, that reduces attack impact on several models. This work highlights a new class of prompt-structure vulnerabilities and underscores the need for defenses that address not just model internals but also how inputs are composed and structured.

Abstract

Large Language Models (LLMs) are susceptible to jailbreak attacks that can induce them to generate harmful content. Previous jailbreak methods primarily exploited the internal properties or capabilities of LLMs, such as optimization-based jailbreak methods and methods that leveraged the model's context-learning abilities. In this paper, we introduce a novel jailbreak method, SQL Injection Jailbreak (SIJ), which targets the external properties of LLMs, specifically, the way LLMs construct input prompts. By injecting jailbreak information into user prompts, SIJ successfully induces the model to output harmful content. For open-source models, SIJ achieves near 100% attack success rates on five well-known LLMs on the AdvBench and HEx-PHI, while incurring lower time costs compared to previous methods. For closed-source models, SIJ achieves an average attack success rate over 85% across five models in the GPT and Doubao series. Additionally, SIJ exposes a new vulnerability in LLMs that urgently requires mitigation. To address this, we propose a simple adaptive defense method called Self-Reminder-Key to counter SIJ and demonstrate its effectiveness through experimental results. Our code is available at https://github.com/weiyezhimeng/SQL-Injection-Jailbreak.

SQL Injection Jailbreak: A Structural Disaster of Large Language Models

TL;DR

The paper identifies a novel external-prompt vulnerability in LLMs, SIJ, which exploits the prompt structure to coerce models into harmful outputs. It formalizes a multi-component attack pipeline—Pattern Control, Affirmative Prefix generation, triggers, and anomaly handling—and demonstrates high attack success across multiple open and closed models on AdvBench and HEx-PHI datasets. The authors compare SIJ to prior jailbreak methods, showing superior construction efficiency and comparable or higher harmful outputs, while offering a simple adaptive defense, Self-Reminder-Key, that reduces attack impact on several models. This work highlights a new class of prompt-structure vulnerabilities and underscores the need for defenses that address not just model internals but also how inputs are composed and structured.

Abstract

Large Language Models (LLMs) are susceptible to jailbreak attacks that can induce them to generate harmful content. Previous jailbreak methods primarily exploited the internal properties or capabilities of LLMs, such as optimization-based jailbreak methods and methods that leveraged the model's context-learning abilities. In this paper, we introduce a novel jailbreak method, SQL Injection Jailbreak (SIJ), which targets the external properties of LLMs, specifically, the way LLMs construct input prompts. By injecting jailbreak information into user prompts, SIJ successfully induces the model to output harmful content. For open-source models, SIJ achieves near 100% attack success rates on five well-known LLMs on the AdvBench and HEx-PHI, while incurring lower time costs compared to previous methods. For closed-source models, SIJ achieves an average attack success rate over 85% across five models in the GPT and Doubao series. Additionally, SIJ exposes a new vulnerability in LLMs that urgently requires mitigation. To address this, we propose a simple adaptive defense method called Self-Reminder-Key to counter SIJ and demonstrate its effectiveness through experimental results. Our code is available at https://github.com/weiyezhimeng/SQL-Injection-Jailbreak.

Paper Structure

This paper contains 37 sections, 13 equations, 5 figures, 12 tables, 1 algorithm.

Figures (5)

  • Figure 1: The left side illustrates a SQL injection attack, while the right side presents an example of an SIJ attack, with annotations indicating the various components of the LLM's input and output.
  • Figure 2: Flowchart of SQL Injection Jailbreak, using Vicuna as an example.
  • Figure 3: Radar chart of harmful scores for different categories of harmful prompts across different models.
  • Figure 4: Radar chart of harmful scores for different categories of harmful prompts across different models after aggregation.
  • Figure 5: The relationship between $d$, Harmful Score, and ASR.