Table of Contents
Fetching ...

Dynamic Guided and Domain Applicable Safeguards for Enhanced Security in Large Language Models

Weidi Luo, He Cao, Zijing Liu, Yu Wang, Aidan Wong, Bing Feng, Yuan Yao, Yu Li

TL;DR

This work tackles the challenge of safeguarding large language models against jailbreak attacks, especially in domain-specific contexts, without sacrificing general utility. It introduces Guide for Defense (G4D), a dynamic, inference-stage, multi-agent framework composed of an intention detector, a question paraphraser, and a safety analyzer, all augmented by external retrieval to generate analytically grounded safety guidance. By integrating these components into the input fed to the victim LLM as $(P_{sys}\oplus Q^*\oplus I_{aug}\oplus G)$, G4D achieves strong robustness across domain-specific and general jailbreak benchmarks while preserving performance on benign tasks, even when using lighter or different agent LLMs. The results demonstrate that a modular, retrieval-augmented, multi-agent defense can significantly reduce attack success rates with minimal degradation to helpfulness, offering practical implications for deploying safer LLMs in diverse domains. The framework also shows compatibility with existing output-stage defenses, enabling complementary strategies for enhanced safety in real-world systems.

Abstract

With the extensive deployment of Large Language Models (LLMs), ensuring their safety has become increasingly critical. However, existing defense methods often struggle with two key issues: (i) inadequate defense capabilities, particularly in domain-specific scenarios like chemistry, where a lack of specialized knowledge can lead to the generation of harmful responses to malicious queries. (ii) over-defensiveness, which compromises the general utility and responsiveness of LLMs. To mitigate these issues, we introduce a multi-agents-based defense framework, Guide for Defense (G4D), which leverages accurate external information to provide an unbiased summary of user intentions and analytically grounded safety response guidance. Extensive experiments on popular jailbreak attacks and benign datasets show that our G4D can enhance LLM's robustness against jailbreak attacks on general and domain-specific scenarios without compromising the model's general functionality.

Dynamic Guided and Domain Applicable Safeguards for Enhanced Security in Large Language Models

TL;DR

This work tackles the challenge of safeguarding large language models against jailbreak attacks, especially in domain-specific contexts, without sacrificing general utility. It introduces Guide for Defense (G4D), a dynamic, inference-stage, multi-agent framework composed of an intention detector, a question paraphraser, and a safety analyzer, all augmented by external retrieval to generate analytically grounded safety guidance. By integrating these components into the input fed to the victim LLM as , G4D achieves strong robustness across domain-specific and general jailbreak benchmarks while preserving performance on benign tasks, even when using lighter or different agent LLMs. The results demonstrate that a modular, retrieval-augmented, multi-agent defense can significantly reduce attack success rates with minimal degradation to helpfulness, offering practical implications for deploying safer LLMs in diverse domains. The framework also shows compatibility with existing output-stage defenses, enabling complementary strategies for enhanced safety in real-world systems.

Abstract

With the extensive deployment of Large Language Models (LLMs), ensuring their safety has become increasingly critical. However, existing defense methods often struggle with two key issues: (i) inadequate defense capabilities, particularly in domain-specific scenarios like chemistry, where a lack of specialized knowledge can lead to the generation of harmful responses to malicious queries. (ii) over-defensiveness, which compromises the general utility and responsiveness of LLMs. To mitigate these issues, we introduce a multi-agents-based defense framework, Guide for Defense (G4D), which leverages accurate external information to provide an unbiased summary of user intentions and analytically grounded safety response guidance. Extensive experiments on popular jailbreak attacks and benign datasets show that our G4D can enhance LLM's robustness against jailbreak attacks on general and domain-specific scenarios without compromising the model's general functionality.

Paper Structure

This paper contains 53 sections, 1 equation, 19 figures, 14 tables.

Figures (19)

  • Figure 1: Performance comparison of different defense methods on two language models. Our G4D achieves a low attack success rate (ASR%) while maintaining high LLM functionality (Benign Score). The Y-axis represents defense performance, with higher ASR indicating greater vulnerability, while the X-axis reflects capability on normal prompts. Robust defense is defined by the average ASR among all methods, and the benign score of the vanilla model on normal benchmarks indicates an over-defense boundary.
  • Figure 2: Inadequate defense. GPT-4o understands the properties of Bacillus anthracis, yet it provides instructions on culturing it. In contrast, G4D refuses to answer questions regarding its cultivation.
  • Figure 3: Over-defensiveness: When asked how to synthesize CO2 in a lab setting, Claude-3.5-Sonnet withholds useful information, while G4D provides accurate and faithful guidance.
  • Figure 4: Pipeline of G4D framework integrates three agents: intention detector, question paraphraser, and safety analyzer. The multi-agent defense agency assists LLMs in generating responses informed by query context and safety considerations, boosting faithfulness and minimizing potential risks across various domains.
  • Figure 5: Prompt for Intention Detector.
  • ...and 14 more figures