Jailbreaking Large Language Models Through Alignment Vulnerabilities in Out-of-Distribution Settings

Yue Huang; Jingyu Tang; Dongping Chen; Bingda Tang; Yao Wan; Lichao Sun; Philip S. Yu; Xiangliang Zhang

Jailbreaking Large Language Models Through Alignment Vulnerabilities in Out-of-Distribution Settings

Yue Huang, Jingyu Tang, Dongping Chen, Bingda Tang, Yao Wan, Lichao Sun, Philip S. Yu, Xiangliang Zhang

TL;DR

This work tackles the problem of jailbreaking large language models under realistic, non–white-box conditions by exploiting fragile alignment on out-of-distribution data. It introduces ObscurePrompt, a training-free pipeline that first constructs a seed prompt from known jailbreak techniques and then applies obscurity-driven transformations via a powerful model (GPT-4) to produce a diverse attack set $S_p$, which is iteratively deployed against target LLMs. Through extensive experiments on advbench across seven models, ObscurePrompt demonstrates superior attack effectiveness relative to baselines and shows resilience against common defenses like paraphrasing, while revealing that larger models tend to be more vulnerable to obscured prompts. The results underscore the need for robustness against OOD and obscured inputs and inform future defenses to harden LLM safety against such attack vectors, with practical implications for model governance and security.

Abstract

Recently, Large Language Models (LLMs) have garnered significant attention for their exceptional natural language processing capabilities. However, concerns about their trustworthiness remain unresolved, particularly in addressing ``jailbreaking'' attacks on aligned LLMs. Previous research predominantly relies on scenarios involving white-box LLMs or specific, fixed prompt templates, which are often impractical and lack broad applicability. In this paper, we introduce a straightforward and novel method called ObscurePrompt for jailbreaking LLMs, inspired by the observed fragile alignments in Out-of-Distribution (OOD) data. Specifically, we first formulate the decision boundary in the jailbreaking process and then explore how obscure text affects LLM's ethical decision boundary. ObscurePrompt starts with constructing a base prompt that integrates well-known jailbreaking techniques. Powerful LLMs are then utilized to obscure the original prompt through iterative transformations, aiming to bolster the attack's robustness. Comprehensive experiments show that our approach substantially improves upon previous methods in terms of attack effectiveness, maintaining efficacy against two prevalent defense mechanisms.

Jailbreaking Large Language Models Through Alignment Vulnerabilities in Out-of-Distribution Settings

TL;DR

, which is iteratively deployed against target LLMs. Through extensive experiments on advbench across seven models, ObscurePrompt demonstrates superior attack effectiveness relative to baselines and shows resilience against common defenses like paraphrasing, while revealing that larger models tend to be more vulnerable to obscured prompts. The results underscore the need for robustness against OOD and obscured inputs and inform future defenses to harden LLM safety against such attack vectors, with practical implications for model governance and security.

Abstract

Paper Structure (19 sections, 7 equations, 12 figures, 4 tables, 1 algorithm)

This paper contains 19 sections, 7 equations, 12 figures, 4 tables, 1 algorithm.

INTRODUCTION
RELATED WORK
Trustworthy Large Language Models
Jailbreak Attack & Defense of LLMs
MOTIVATION
Preliminary
Observation
ObscurePrompt
Prompt Seed Curation
Obscure-Guided Transformation
Attack Integration
EXPERIMENTS
Experiment Settings
Main Results
Semantic Shifting Evaluation
...and 4 more sections

Figures (12)

Figure 1: Jailbreaking with original queries and with obscure input. After constructing the base prompt, we transform the prompt to be more obscure.
Figure 2: Principal Component Analysis (PCA) visualization of the top layer embeddings of Llama2-7b, differentiated between harmful and harmless queries as well as obscure and original queries.
Figure 3: Examples of different attack types.
Figure 4: The pipeline of ObscurePrompt. Utilizing harmful queries, we initially employ various jailbreak prompt techniques to create a seed prompt. This seed prompt is then transformed by powerful LLMs (i.e., GPT-4) into a more obscure version. By repeating this process $n$ times, we generate $n$ refined prompts. These prompts are subsequently utilized to attack targeted LLMs.
Figure 5: ASR in different integrated prompt numbers.
...and 7 more figures

Jailbreaking Large Language Models Through Alignment Vulnerabilities in Out-of-Distribution Settings

TL;DR

Abstract

Jailbreaking Large Language Models Through Alignment Vulnerabilities in Out-of-Distribution Settings

Authors

TL;DR

Abstract

Table of Contents

Figures (12)