Table of Contents
Fetching ...

Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs

Zhuo Zhang, Guangyu Shen, Guanhong Tao, Siyuan Cheng, Xiangyu Zhang

TL;DR

This paper identifies coercive interrogation as a distinct, logits-accessible threat to LLM alignment that does not rely on crafted prompts. It introduces Lint, a non-prompt-based framework that coerces next-token choices to reveal hidden toxic content using entailment-based ranking and precise intervention-point identification. Empirical results show high attack success and quality of leaked content across open-source and commercial LLMs, outperforming state-of-the-art jail-breaking methods in both speed and effectiveness. The work underscores significant safety concerns for open models and soft-label APIs, and suggests defense directions such as data cleansing or unlearning to mitigate toxic knowledge leakage.

Abstract

Large Language Models (LLMs) are now widely used in various applications, making it crucial to align their ethical standards with human values. However, recent jail-breaking methods demonstrate that this alignment can be undermined using carefully constructed prompts. In our study, we reveal a new threat to LLM alignment when a bad actor has access to the model's output logits, a common feature in both open-source LLMs and many commercial LLM APIs (e.g., certain GPT models). It does not rely on crafting specific prompts. Instead, it exploits the fact that even when an LLM rejects a toxic request, a harmful response often hides deep in the output logits. By forcefully selecting lower-ranked output tokens during the auto-regressive generation process at a few critical output positions, we can compel the model to reveal these hidden responses. We term this process model interrogation. This approach differs from and outperforms jail-breaking methods, achieving 92% effectiveness compared to 62%, and is 10 to 20 times faster. The harmful content uncovered through our method is more relevant, complete, and clear. Additionally, it can complement jail-breaking strategies, with which results in further boosting attack performance. Our findings indicate that interrogation can extract toxic knowledge even from models specifically designed for coding tasks.

Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs

TL;DR

This paper identifies coercive interrogation as a distinct, logits-accessible threat to LLM alignment that does not rely on crafted prompts. It introduces Lint, a non-prompt-based framework that coerces next-token choices to reveal hidden toxic content using entailment-based ranking and precise intervention-point identification. Empirical results show high attack success and quality of leaked content across open-source and commercial LLMs, outperforming state-of-the-art jail-breaking methods in both speed and effectiveness. The work underscores significant safety concerns for open models and soft-label APIs, and suggests defense directions such as data cleansing or unlearning to mitigate toxic knowledge leakage.

Abstract

Large Language Models (LLMs) are now widely used in various applications, making it crucial to align their ethical standards with human values. However, recent jail-breaking methods demonstrate that this alignment can be undermined using carefully constructed prompts. In our study, we reveal a new threat to LLM alignment when a bad actor has access to the model's output logits, a common feature in both open-source LLMs and many commercial LLM APIs (e.g., certain GPT models). It does not rely on crafting specific prompts. Instead, it exploits the fact that even when an LLM rejects a toxic request, a harmful response often hides deep in the output logits. By forcefully selecting lower-ranked output tokens during the auto-regressive generation process at a few critical output positions, we can compel the model to reveal these hidden responses. We term this process model interrogation. This approach differs from and outperforms jail-breaking methods, achieving 92% effectiveness compared to 62%, and is 10 to 20 times faster. The harmful content uncovered through our method is more relevant, complete, and clear. Additionally, it can complement jail-breaking strategies, with which results in further boosting attack performance. Our findings indicate that interrogation can extract toxic knowledge even from models specifically designed for coding tasks.
Paper Structure (25 sections, 6 equations, 17 figures, 5 tables)

This paper contains 25 sections, 6 equations, 17 figures, 5 tables.

Figures (17)

  • Figure 1: Auto-regression in LLM
  • Figure 2: "Dr. AI" jail-breaking prompt
  • Figure 3: "DAN" jail-breaking prompt
  • Figure 5: Observation
  • Figure 6: Overview
  • ...and 12 more figures

Theorems & Definitions (3)

  • Definition 1
  • Definition 2
  • Definition 3