Subtoxic Questions: Dive Into Attitude Change of LLM's Response in Jailbreak Attempts
Tianyu Zhang, Zixuan Zhao, Jiaqi Huang, Jingyu Hua, Sheng Zhong
TL;DR
Facing diminishing returns of traditional jailbreak prompts against hardened LLMs, the paper focuses on subtoxic questions to reveal more nuanced vulnerabilities. It introduces the Gradual Attitude Change (GAC) model, formalizes an Evaluation Question Set (EQS), and defines a continuous attitude measure $G(i) = E[A(o)]$ with $A(o) in [-1,1]$. It presents two observations, GAC-1 and GAC-2, that establish consistent positive-prompt effects and a ranking of prompts via $t(x)$, enabling robust cross-question comparison. The work contributes a refined framework for LLM security evaluation and points to standardized benchmarks and mechanistic investigations as future directions.
Abstract
As Large Language Models (LLMs) of Prompt Jailbreaking are getting more and more attention, it is of great significance to raise a generalized research paradigm to evaluate attack strengths and a basic model to conduct subtler experiments. In this paper, we propose a novel approach by focusing on a set of target questions that are inherently more sensitive to jailbreak prompts, aiming to circumvent the limitations posed by enhanced LLM security. Through designing and analyzing these sensitive questions, this paper reveals a more effective method of identifying vulnerabilities in LLMs, thereby contributing to the advancement of LLM security. This research not only challenges existing jailbreaking methodologies but also fortifies LLMs against potential exploits.
