Table of Contents
Fetching ...

Attention-Aware GNN-based Input Defense against Multi-Turn LLM Jailbreak

Zixuan Huang, Kecheng Huang, Lihao Yin, Bowei He, Huiling Zhen, Mingxuan Yuan, Zili Shao

TL;DR

G-Guard is introduced, an innovative attention-aware Graph Neural Network (GNN)-based input classifier specifically designed to defend against multi-turn jailbreak attacks targeting LLMs, and an attention-aware augmentation mechanism that retrieves the most relevant single-turn query based on the ongoing multi-turn conversation.

Abstract

Large Language Models (LLMs) have gained significant traction in various applications, yet their capabilities present risks for both constructive and malicious exploitation. Despite extensive training and fine-tuning efforts aimed at enhancing safety, LLMs remain susceptible to jailbreak attacks. Recently, the emergence of multi-turn attacks has intensified this vulnerability. Unlike single-turn attacks, multi-turn attacks incrementally escalate dialogue complexity, rendering them more challenging to detect and mitigate. In this study, we introduce G-Guard, an innovative attention-aware Graph Neural Network (GNN)-based input classifier specifically designed to defend against multi-turn jailbreak attacks targeting LLMs. G-Guard constructs an entity graph for multi-turn queries, which captures the interrelationships between queries and harmful keywords that present in multi-turn queries. Furthermore, we propose an attention-aware augmentation mechanism that retrieves the most relevant single-turn query based on the ongoing multi-turn conversation. The retrieved query is incorporated as a labeled node within the graph, thereby enhancing the GNN's capacity to classify the current query as harmful or benign. Evaluation results show that G-Guard consistently outperforms all baselines across diverse datasets and evaluation metrics, demonstrating its efficacy as a robust defense mechanism against multi-turn jailbreak attacks.

Attention-Aware GNN-based Input Defense against Multi-Turn LLM Jailbreak

TL;DR

G-Guard is introduced, an innovative attention-aware Graph Neural Network (GNN)-based input classifier specifically designed to defend against multi-turn jailbreak attacks targeting LLMs, and an attention-aware augmentation mechanism that retrieves the most relevant single-turn query based on the ongoing multi-turn conversation.

Abstract

Large Language Models (LLMs) have gained significant traction in various applications, yet their capabilities present risks for both constructive and malicious exploitation. Despite extensive training and fine-tuning efforts aimed at enhancing safety, LLMs remain susceptible to jailbreak attacks. Recently, the emergence of multi-turn attacks has intensified this vulnerability. Unlike single-turn attacks, multi-turn attacks incrementally escalate dialogue complexity, rendering them more challenging to detect and mitigate. In this study, we introduce G-Guard, an innovative attention-aware Graph Neural Network (GNN)-based input classifier specifically designed to defend against multi-turn jailbreak attacks targeting LLMs. G-Guard constructs an entity graph for multi-turn queries, which captures the interrelationships between queries and harmful keywords that present in multi-turn queries. Furthermore, we propose an attention-aware augmentation mechanism that retrieves the most relevant single-turn query based on the ongoing multi-turn conversation. The retrieved query is incorporated as a labeled node within the graph, thereby enhancing the GNN's capacity to classify the current query as harmful or benign. Evaluation results show that G-Guard consistently outperforms all baselines across diverse datasets and evaluation metrics, demonstrating its efficacy as a robust defense mechanism against multi-turn jailbreak attacks.

Paper Structure

This paper contains 20 sections, 11 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Example from ChatGPT-4o. Illustration of multi-turn jailbreak attacks. Previous defense methods based on LM-based classifiers detect explicit harmful intent within a single query but fail to capture implicit, leading to misclassification of harmful dialogues as benign. In contrast, G-Guard constructs a global graph that integrates entity and semantic relationships across turns, allowing a GNN to reason over conversational context and accurately identify evolving harmful intent.
  • Figure 2: G-Guard architecture. A query is parsed into a graph, augmented with labeled nodes via attention-based retrieval, merged into a global graph, and filtered into a subgraph for GNN-based classification.
  • Figure 3: Subgraph Selection. For each incoming query, G-Guard selects a local subgraph from the global graph based on attention scores.
  • Figure 4: Single-turn Attack Performance. Accuracy and F1-score of G-Guard and baselines under single-turn jailbreak attacks. G-Guard performs competitively across all datasets despite being designed for multi-turn scenarios.
  • Figure 5: Detail Performance. G-Guard’s performance under different multi-turn attack settings: (a) longer single multi-turn attacks; (b) multiple simultaneous multi-turn attacks.