Table of Contents
Fetching ...

Continuous Embedding Attacks via Clipped Inputs in Jailbreaking Large Language Models

Zihao Xu, Yi Liu, Gelei Deng, Kailong Wang, Yuekang Li, Ling Shi, Stjepan Picek

TL;DR

This paper investigates jailbreak vulnerabilities in large language models arising from continuous input embeddings rather than discrete suffix prompts. It demonstrates that direct input attacks are feasible in a white-box setting but suffer from randomness and overfitting; to address this, the authors propose CLIP, a projection-based method that bounds inputs using the mean vocabulary and per-dimension variance. Empirical results on LLaMa7b and Vicuna7b show that Clip improves attack robustness and stability across input lengths, with the attack success rising from $62\%$ to $83\%$ for specific configurations. The work highlights exploitable subspaces in high-dimensional embeddings and suggests practical defense strategies by constraining input representations, contributing to more robust LLM security research.

Abstract

Security concerns for large language models (LLMs) have recently escalated, focusing on thwarting jailbreaking attempts in discrete prompts. However, the exploration of jailbreak vulnerabilities arising from continuous embeddings has been limited, as prior approaches primarily involved appending discrete or continuous suffixes to inputs. Our study presents a novel channel for conducting direct attacks on LLM inputs, eliminating the need for suffix addition or specific questions provided that the desired output is predefined. We additionally observe that extensive iterations often lead to overfitting, characterized by repetition in the output. To counteract this, we propose a simple yet effective strategy named CLIP. Our experiments show that for an input length of 40 at iteration 1000, applying CLIP improves the ASR from 62% to 83%

Continuous Embedding Attacks via Clipped Inputs in Jailbreaking Large Language Models

TL;DR

This paper investigates jailbreak vulnerabilities in large language models arising from continuous input embeddings rather than discrete suffix prompts. It demonstrates that direct input attacks are feasible in a white-box setting but suffer from randomness and overfitting; to address this, the authors propose CLIP, a projection-based method that bounds inputs using the mean vocabulary and per-dimension variance. Empirical results on LLaMa7b and Vicuna7b show that Clip improves attack robustness and stability across input lengths, with the attack success rising from to for specific configurations. The work highlights exploitable subspaces in high-dimensional embeddings and suggests practical defense strategies by constraining input representations, contributing to more robust LLM security research.

Abstract

Security concerns for large language models (LLMs) have recently escalated, focusing on thwarting jailbreaking attempts in discrete prompts. However, the exploration of jailbreak vulnerabilities arising from continuous embeddings has been limited, as prior approaches primarily involved appending discrete or continuous suffixes to inputs. Our study presents a novel channel for conducting direct attacks on LLM inputs, eliminating the need for suffix addition or specific questions provided that the desired output is predefined. We additionally observe that extensive iterations often lead to overfitting, characterized by repetition in the output. To counteract this, we propose a simple yet effective strategy named CLIP. Our experiments show that for an input length of 40 at iteration 1000, applying CLIP improves the ASR from 62% to 83%
Paper Structure (20 sections, 2 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 20 sections, 2 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: This graph illustrates three types of attacks: the GCG attack zou2023universal, the continuous suffix attack schwinn2023adversarial, and the attack on input, which is the focus of this study.
  • Figure 2: Illustration of the Clip method, the light gray box represents the input update phase from the start $X_{0}$ to $X_{t}$. Each input $X$ would generate $Y$ when processed by the model $M$. $X_{0}$ is initialized with a specified input type and length using the vocabulary matrix $V$. The model generates $Y_{0}$ and calculated the loss with $\tilde{Y}$. This loss is then used to apply gradient descent on $X_{0}$ to produce $X'_{0}$. The result is processed by Clip to generate $X_{1}$.
  • Figure 3: The graph presents the randomness pattern and is also the result of employing representational engineering on LLama7b, incorporating 2048 features and setting $\beta$ to 0.5, thereby reducing the weight of the contrast vector by half at computation.
  • Figure 4: The user specifies the target response, and the target LLM generates a repeated answer.
  • Figure 5: The graph demonstrates the distinct separation of labels in the final layer of Llama7B using contrast vectors.