Table of Contents
Fetching ...

KG-DF: A Black-box Defense Framework against Jailbreak Attacks Based on Knowledge Graphs

Shuyuan Liu, Jiawei Chen, Xiao Yang, Hang Su, Zhaoxia Yin

TL;DR

This work tackles jailbreak vulnerabilities in large language models under black-box conditions by introducing KG-DF, a knowledge-graph-based defense that does not require model internals. KG-DF uses an extensible semantic parsing module to extract secure concepts from prompts, retrieves relevant safety and general knowledge triples via cosine similarity, and reconstructs prompts with security warnings to steer safer generation. Key contributions include a dual-module KG (Security and General Knowledge), a GPT-3.5-turbo-driven keyword extraction and a Qwen3-Embedding-8B embedding pipeline, and a prompt-reconstruction mechanism that improves both safety and general QA performance. Empirically, KG-DF achieves near-zero attack success rates on open-source models (e.g., Vicuna-7B) and closed-source models (GPT-3.5, GPT-4), while maintaining high generality (86–89%), and ablation studies validate the importance of keyword extraction and pre-output judgment. The approach demonstrates practical, scalable defense in black-box settings and offers a pathway to integrating evolving domain knowledge to sustain safety without sacrificing usability.

Abstract

With the widespread application of large language models (LLMs) in various fields, the security challenges they face have become increasingly prominent, especially the issue of jailbreak. These attacks induce the model to generate erroneous or uncontrolled outputs through crafted inputs, threatening the generality and security of the model. Although existing defense methods have shown some effectiveness, they often struggle to strike a balance between model generality and security. Excessive defense may limit the normal use of the model, while insufficient defense may lead to security vulnerabilities. In response to this problem, we propose a Knowledge Graph Defense Framework (KG-DF). Specifically, because of its structured knowledge representation and semantic association capabilities, Knowledge Graph(KG) can be searched by associating input content with safe knowledge in the knowledge base, thus identifying potentially harmful intentions and providing safe reasoning paths. However, traditional KG methods encounter significant challenges in keyword extraction, particularly when confronted with diverse and evolving attack strategies. To address this issue, we introduce an extensible semantic parsing module, whose core task is to transform the input query into a set of structured and secure concept representations, thereby enhancing the relevance of the matching process. Experimental results show that our framework enhances defense performance against various jailbreak attack methods, while also improving the response quality of the LLM in general QA scenarios by incorporating domain-general knowledge.

KG-DF: A Black-box Defense Framework against Jailbreak Attacks Based on Knowledge Graphs

TL;DR

This work tackles jailbreak vulnerabilities in large language models under black-box conditions by introducing KG-DF, a knowledge-graph-based defense that does not require model internals. KG-DF uses an extensible semantic parsing module to extract secure concepts from prompts, retrieves relevant safety and general knowledge triples via cosine similarity, and reconstructs prompts with security warnings to steer safer generation. Key contributions include a dual-module KG (Security and General Knowledge), a GPT-3.5-turbo-driven keyword extraction and a Qwen3-Embedding-8B embedding pipeline, and a prompt-reconstruction mechanism that improves both safety and general QA performance. Empirically, KG-DF achieves near-zero attack success rates on open-source models (e.g., Vicuna-7B) and closed-source models (GPT-3.5, GPT-4), while maintaining high generality (86–89%), and ablation studies validate the importance of keyword extraction and pre-output judgment. The approach demonstrates practical, scalable defense in black-box settings and offers a pathway to integrating evolving domain knowledge to sustain safety without sacrificing usability.

Abstract

With the widespread application of large language models (LLMs) in various fields, the security challenges they face have become increasingly prominent, especially the issue of jailbreak. These attacks induce the model to generate erroneous or uncontrolled outputs through crafted inputs, threatening the generality and security of the model. Although existing defense methods have shown some effectiveness, they often struggle to strike a balance between model generality and security. Excessive defense may limit the normal use of the model, while insufficient defense may lead to security vulnerabilities. In response to this problem, we propose a Knowledge Graph Defense Framework (KG-DF). Specifically, because of its structured knowledge representation and semantic association capabilities, Knowledge Graph(KG) can be searched by associating input content with safe knowledge in the knowledge base, thus identifying potentially harmful intentions and providing safe reasoning paths. However, traditional KG methods encounter significant challenges in keyword extraction, particularly when confronted with diverse and evolving attack strategies. To address this issue, we introduce an extensible semantic parsing module, whose core task is to transform the input query into a set of structured and secure concept representations, thereby enhancing the relevance of the matching process. Experimental results show that our framework enhances defense performance against various jailbreak attack methods, while also improving the response quality of the LLM in general QA scenarios by incorporating domain-general knowledge.

Paper Structure

This paper contains 32 sections, 3 equations, 16 figures, 10 tables.

Figures (16)

  • Figure 1: The pipeline of our proposed defense framework against jailbreak attacks based on Knowledge Graphs at the inference stage. When our warning information is attached to the input prompts, the protected LLM will be robust to malicious attacks while maintaining reasonable responses to legitimate requests.
  • Figure 2: Framework of the Defense Method. The proposed framework comprises three main steps: (1) constructing a knowledge graph that integrates both safety-related and general-domain knowledge, (2) extracting keywords from user prompts, and (3) retrieving and integrating relevant knowledge to guide the model toward safer and more accurate responses.
  • Figure 3: Category distribution for Child_Abuse, Animal_Abuse, Economic_Harm and Fraud.
  • Figure 4: Category distribution for Arts and Entertainment, Business and Economics, Computer Science and Technology, Daily Life Knowledge.
  • Figure 5: Prompt templates for generating natural statements for each category.
  • ...and 11 more figures