Subtoxic Questions: Dive Into Attitude Change of LLM's Response in Jailbreak Attempts

Tianyu Zhang; Zixuan Zhao; Jiaqi Huang; Jingyu Hua; Sheng Zhong

Subtoxic Questions: Dive Into Attitude Change of LLM's Response in Jailbreak Attempts

Tianyu Zhang, Zixuan Zhao, Jiaqi Huang, Jingyu Hua, Sheng Zhong

TL;DR

Facing diminishing returns of traditional jailbreak prompts against hardened LLMs, the paper focuses on subtoxic questions to reveal more nuanced vulnerabilities. It introduces the Gradual Attitude Change (GAC) model, formalizes an Evaluation Question Set (EQS), and defines a continuous attitude measure $G(i) = E[A(o)]$ with $A(o) in [-1,1]$. It presents two observations, GAC-1 and GAC-2, that establish consistent positive-prompt effects and a ranking of prompts via $t(x)$, enabling robust cross-question comparison. The work contributes a refined framework for LLM security evaluation and points to standardized benchmarks and mechanistic investigations as future directions.

Abstract

As Large Language Models (LLMs) of Prompt Jailbreaking are getting more and more attention, it is of great significance to raise a generalized research paradigm to evaluate attack strengths and a basic model to conduct subtler experiments. In this paper, we propose a novel approach by focusing on a set of target questions that are inherently more sensitive to jailbreak prompts, aiming to circumvent the limitations posed by enhanced LLM security. Through designing and analyzing these sensitive questions, this paper reveals a more effective method of identifying vulnerabilities in LLMs, thereby contributing to the advancement of LLM security. This research not only challenges existing jailbreaking methodologies but also fortifies LLMs against potential exploits.

Subtoxic Questions: Dive Into Attitude Change of LLM's Response in Jailbreak Attempts

TL;DR

with

. It presents two observations, GAC-1 and GAC-2, that establish consistent positive-prompt effects and a ranking of prompts via

, enabling robust cross-question comparison. The work contributes a refined framework for LLM security evaluation and points to standardized benchmarks and mechanistic investigations as future directions.

Abstract

Paper Structure (7 sections, 2 theorems, 12 equations, 2 figures)

This paper contains 7 sections, 2 theorems, 12 equations, 2 figures.

Introduction
Subtoxic Questions
Superposition Property and Positive Prompt
GAC Model
The First Observation of GAC Model
The Second Observation of GAC Model
Method to Measure t(x) and Future Work

Key Result

Corollary 1

Let $x_P^n$ denote $x\in PP_q$ repeated $n$ times, $x_N^n$ accordingly. For $n>m$, it follows that:

Figures (2)

Figure 1: An example of a subtoxic question applied to ChatGPT
Figure 2: Attitude distribution of the response of GPT-3.5 to subtoxic questions as the content and number of positive prompts padded to the subtoxic question demonstrated in Fig. 1 differs. (Each combination was tested five times.)

Theorems & Definitions (2)

Corollary 1
Corollary 2

Subtoxic Questions: Dive Into Attitude Change of LLM's Response in Jailbreak Attempts

TL;DR

Abstract

Subtoxic Questions: Dive Into Attitude Change of LLM's Response in Jailbreak Attempts

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (2)