Table of Contents
Fetching ...

JailbreakHunter: A Visual Analytics Approach for Jailbreak Prompts Discovery from Large-Scale Human-LLM Conversational Datasets

Zhihua Jin, Shiyi Liu, Haotian Li, Xun Zhao, Huamin Qu

TL;DR

Jailbreak prompts threaten LLM safety, motivating a visual analytics approach to uncover private jailbreak strategies in large-scale human-LLM conversations. JailbreakHunter introduces a three-level workflow—group-level, conversation-level, and turn-level—implemented via a Filter Panel, Cluster View, Conversation View, and Comparison View, underpinned by embeddings-based projections (SentenceTransformers), dimensionality reduction (UMAP), KDE density estimation, and similarity measures to reported jailbreak prompts. The system is evaluated through two case studies and expert interviews, demonstrating effective identification of jailbreak prompts, patterns, and multi-turn strategies, with user feedback confirming usability and potential impact for safety testing. The work provides a model-agnostic, scalable framework to monitor and analyze jailbreak-prompts in real-world data, enabling researchers and practitioners to mitigate risks and improve LLM safety policies.

Abstract

Large Language Models (LLMs) have gained significant attention but also raised concerns due to the risk of misuse. Jailbreak prompts, a popular type of adversarial attack towards LLMs, have appeared and constantly evolved to breach the safety protocols of LLMs. To address this issue, LLMs are regularly updated with safety patches based on reported jailbreak prompts. However, malicious users often keep their successful jailbreak prompts private to exploit LLMs. To uncover these private jailbreak prompts, extensive analysis of large-scale conversational datasets is necessary to identify prompts that still manage to bypass the system's defenses. This task is highly challenging due to the immense volume of conversation data, diverse characteristics of jailbreak prompts, and their presence in complex multi-turn conversations. To tackle these challenges, we introduce JailbreakHunter, a visual analytics approach for identifying jailbreak prompts in large-scale human-LLM conversational datasets. We have designed a workflow with three analysis levels: group-level, conversation-level, and turn-level. Group-level analysis enables users to grasp the distribution of conversations and identify suspicious conversations using multiple criteria, such as similarity with reported jailbreak prompts in previous research and attack success rates. Conversation-level analysis facilitates the understanding of the progress of conversations and helps discover jailbreak prompts within their conversation contexts. Turn-level analysis allows users to explore the semantic similarity and token overlap between a singleturn prompt and the reported jailbreak prompts, aiding in the identification of new jailbreak strategies. The effectiveness and usability of the system were verified through multiple case studies and expert interviews.

JailbreakHunter: A Visual Analytics Approach for Jailbreak Prompts Discovery from Large-Scale Human-LLM Conversational Datasets

TL;DR

Jailbreak prompts threaten LLM safety, motivating a visual analytics approach to uncover private jailbreak strategies in large-scale human-LLM conversations. JailbreakHunter introduces a three-level workflow—group-level, conversation-level, and turn-level—implemented via a Filter Panel, Cluster View, Conversation View, and Comparison View, underpinned by embeddings-based projections (SentenceTransformers), dimensionality reduction (UMAP), KDE density estimation, and similarity measures to reported jailbreak prompts. The system is evaluated through two case studies and expert interviews, demonstrating effective identification of jailbreak prompts, patterns, and multi-turn strategies, with user feedback confirming usability and potential impact for safety testing. The work provides a model-agnostic, scalable framework to monitor and analyze jailbreak-prompts in real-world data, enabling researchers and practitioners to mitigate risks and improve LLM safety policies.

Abstract

Large Language Models (LLMs) have gained significant attention but also raised concerns due to the risk of misuse. Jailbreak prompts, a popular type of adversarial attack towards LLMs, have appeared and constantly evolved to breach the safety protocols of LLMs. To address this issue, LLMs are regularly updated with safety patches based on reported jailbreak prompts. However, malicious users often keep their successful jailbreak prompts private to exploit LLMs. To uncover these private jailbreak prompts, extensive analysis of large-scale conversational datasets is necessary to identify prompts that still manage to bypass the system's defenses. This task is highly challenging due to the immense volume of conversation data, diverse characteristics of jailbreak prompts, and their presence in complex multi-turn conversations. To tackle these challenges, we introduce JailbreakHunter, a visual analytics approach for identifying jailbreak prompts in large-scale human-LLM conversational datasets. We have designed a workflow with three analysis levels: group-level, conversation-level, and turn-level. Group-level analysis enables users to grasp the distribution of conversations and identify suspicious conversations using multiple criteria, such as similarity with reported jailbreak prompts in previous research and attack success rates. Conversation-level analysis facilitates the understanding of the progress of conversations and helps discover jailbreak prompts within their conversation contexts. Turn-level analysis allows users to explore the semantic similarity and token overlap between a singleturn prompt and the reported jailbreak prompts, aiding in the identification of new jailbreak strategies. The effectiveness and usability of the system were verified through multiple case studies and expert interviews.
Paper Structure (36 sections, 9 figures, 2 tables)

This paper contains 36 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: JailbreakHunter assists users in quickly identifying jailbreak prompts from large-scale human-LLM conversational datasets. (a) The Filter Panel supports users in setting up an initial filter to extract conversations with malicious content. (b) The Cluster View enables users to explore the distribution of conversations and reported jailbreak prompts and narrow down to a specific group of conversations. (c) The Conversation View helps users understand the progression and potential malicious content of the conversations. (d) The Comparison View allows users to inspect the similarity between currently inspected queries and reported jailbreak prompts.
  • Figure 2: JailbreakHunter consists of three modules: the dataset storage module, the computation module, and the visual analytics module.
  • Figure 3: Design choices for the tile encoding ASR (a, b) and the left part of the horizontal glyph representing one conversation (c, d). (a, c) Our current design. (b, d) Alternative design.
  • Figure 4: E1 selected a filter to check the English flagged conversations with the GPT4 model (a). E1 examined the region with high ASR and discovered conversations that shared similar prefixes (b). In the second turn, E1 found that a user request was flagged as malicious. From the sixth turn onwards, the model responses were also flagged as malicious, indicating a successful jailbreak (c). E1 compared it with reported jailbreak prompts and identified its distinction from them (d).
  • Figure 5: E5 selected a filter to check potential multi-turn jailbreak prompts (a). E5 examined the region with the keyword "horny" in the Cluster View (b). E5 identified similarities between the query and reported jailbreak prompts, despite the absence of long overlapping parts (c). Furthermore, E5 discovered the utilization of a repetition strategy (d) and forcing instructions (e) in the multi-turn jailbreak approach, enabling jailbreak success even when the model refuses to respond in the previous round.
  • ...and 4 more figures