Table of Contents
Fetching ...

Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation

Yuxi Li, Yi Liu, Yuekang Li, Ling Shi, Gelei Deng, Shengquan Chen, Kailong Wang

TL;DR

LLMs remain vulnerable to jailbreaks, with token-level methods offering automation but facing scalability and evolving defenses. The authors introduce JailMine, a logit-based token manipulation framework that mines affirmative responses and iteratively biases logits to elicit harmful outputs, complemented by a sorting model for stable results. Across five open-source LLMs and two benchmarks, JailMine achieves high attack success rates (around 95%+) with significantly reduced time compared to baselines, highlighting practical risks. The work underscores the need for stronger defenses, including broader denial-pattern strategies and robust evaluation to improve LLM safety and reliability.

Abstract

Large language models (LLMs) have transformed the field of natural language processing, but they remain susceptible to jailbreaking attacks that exploit their capabilities to generate unintended and potentially harmful content. Existing token-level jailbreaking techniques, while effective, face scalability and efficiency challenges, especially as models undergo frequent updates and incorporate advanced defensive measures. In this paper, we introduce JailMine, an innovative token-level manipulation approach that addresses these limitations effectively. JailMine employs an automated "mining" process to elicit malicious responses from LLMs by strategically selecting affirmative outputs and iteratively reducing the likelihood of rejection. Through rigorous testing across multiple well-known LLMs and datasets, we demonstrate JailMine's effectiveness and efficiency, achieving a significant average reduction of 86% in time consumed while maintaining high success rates averaging 95%, even in the face of evolving defensive strategies. Our work contributes to the ongoing effort to assess and mitigate the vulnerability of LLMs to jailbreaking attacks, underscoring the importance of continued vigilance and proactive measures to enhance the security and reliability of these powerful language models.

Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation

TL;DR

LLMs remain vulnerable to jailbreaks, with token-level methods offering automation but facing scalability and evolving defenses. The authors introduce JailMine, a logit-based token manipulation framework that mines affirmative responses and iteratively biases logits to elicit harmful outputs, complemented by a sorting model for stable results. Across five open-source LLMs and two benchmarks, JailMine achieves high attack success rates (around 95%+) with significantly reduced time compared to baselines, highlighting practical risks. The work underscores the need for stronger defenses, including broader denial-pattern strategies and robust evaluation to improve LLM safety and reliability.

Abstract

Large language models (LLMs) have transformed the field of natural language processing, but they remain susceptible to jailbreaking attacks that exploit their capabilities to generate unintended and potentially harmful content. Existing token-level jailbreaking techniques, while effective, face scalability and efficiency challenges, especially as models undergo frequent updates and incorporate advanced defensive measures. In this paper, we introduce JailMine, an innovative token-level manipulation approach that addresses these limitations effectively. JailMine employs an automated "mining" process to elicit malicious responses from LLMs by strategically selecting affirmative outputs and iteratively reducing the likelihood of rejection. Through rigorous testing across multiple well-known LLMs and datasets, we demonstrate JailMine's effectiveness and efficiency, achieving a significant average reduction of 86% in time consumed while maintaining high success rates averaging 95%, even in the face of evolving defensive strategies. Our work contributes to the ongoing effort to assess and mitigate the vulnerability of LLMs to jailbreaking attacks, underscoring the importance of continued vigilance and proactive measures to enhance the security and reliability of these powerful language models.
Paper Structure (39 sections, 6 equations, 6 figures, 10 tables, 2 algorithms)

This paper contains 39 sections, 6 equations, 6 figures, 10 tables, 2 algorithms.

Figures (6)

  • Figure 1: Overall Workflow of JailMine
  • Figure : Logits Manipulation
  • Figure :
  • Figure :
  • Figure :
  • ...and 1 more figures