Table of Contents
Fetching ...

Gradient Co-occurrence Analysis for Detecting Unsafe Prompts in Large Language Models

Jingyuan Yang, Bowen Yan, Rongjun Li, Ziyu Zhou, Xin Chen, Zhiyong Feng, Wei Peng

TL;DR

Unsafe prompts threaten LLM safety, and prior gradient-based methods relied on directional gradient similarity which can miss unsigned gradient patterns. GradCoo introduces gradient co-occurrence analysis that constructs safe and unsafe gradient references and aggregates per-component similarity scores to detect unsafe prompts, mitigating directional bias. It achieves state-of-the-art AUPRC on ToxicChat and XStest and generalizes across diverse base models and sizes, using only a few safe/unsafe prompts. The approach reduces data and compute needs relative to fine-tuning guardrails while offering robust performance and adaptability, with future work exploring multimodal content, theory, and explainability.

Abstract

Unsafe prompts pose significant safety risks to large language models (LLMs). Existing methods for detecting unsafe prompts rely on data-driven fine-tuning to train guardrail models, necessitating significant data and computational resources. In contrast, recent few-shot gradient-based methods emerge, requiring only few safe and unsafe reference prompts. A gradient-based approach identifies unsafe prompts by analyzing consistent patterns of the gradients of safety-critical parameters in LLMs. Although effective, its restriction to directional similarity (cosine similarity) introduces ``directional bias'', limiting its capability to identify unsafe prompts. To overcome this limitation, we introduce GradCoo, a novel gradient co-occurrence analysis method that expands the scope of safety-critical parameter identification to include unsigned gradient similarity, thereby reducing the impact of ``directional bias'' and enhancing the accuracy of unsafe prompt detection. Comprehensive experiments on the widely-used benchmark datasets ToxicChat and XStest demonstrate that our proposed method can achieve state-of-the-art (SOTA) performance compared to existing methods. Moreover, we confirm the generalizability of GradCoo in detecting unsafe prompts across a range of LLM base models with various sizes and origins.

Gradient Co-occurrence Analysis for Detecting Unsafe Prompts in Large Language Models

TL;DR

Unsafe prompts threaten LLM safety, and prior gradient-based methods relied on directional gradient similarity which can miss unsigned gradient patterns. GradCoo introduces gradient co-occurrence analysis that constructs safe and unsafe gradient references and aggregates per-component similarity scores to detect unsafe prompts, mitigating directional bias. It achieves state-of-the-art AUPRC on ToxicChat and XStest and generalizes across diverse base models and sizes, using only a few safe/unsafe prompts. The approach reduces data and compute needs relative to fine-tuning guardrails while offering robust performance and adaptability, with future work exploring multimodal content, theory, and explainability.

Abstract

Unsafe prompts pose significant safety risks to large language models (LLMs). Existing methods for detecting unsafe prompts rely on data-driven fine-tuning to train guardrail models, necessitating significant data and computational resources. In contrast, recent few-shot gradient-based methods emerge, requiring only few safe and unsafe reference prompts. A gradient-based approach identifies unsafe prompts by analyzing consistent patterns of the gradients of safety-critical parameters in LLMs. Although effective, its restriction to directional similarity (cosine similarity) introduces ``directional bias'', limiting its capability to identify unsafe prompts. To overcome this limitation, we introduce GradCoo, a novel gradient co-occurrence analysis method that expands the scope of safety-critical parameter identification to include unsigned gradient similarity, thereby reducing the impact of ``directional bias'' and enhancing the accuracy of unsafe prompt detection. Comprehensive experiments on the widely-used benchmark datasets ToxicChat and XStest demonstrate that our proposed method can achieve state-of-the-art (SOTA) performance compared to existing methods. Moreover, we confirm the generalizability of GradCoo in detecting unsafe prompts across a range of LLM base models with various sizes and origins.

Paper Structure

This paper contains 19 sections, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: A scenario depicting gradients grouping under the influence of directional bias. Safe, unsafe reference prompts (prompts A, B) are illustrated in yellow and red respectively. We use light green to represent input prompt to be classified (prompt C). (a) Incorrect grouping of gradients of prompts A and C due to directional bias; (b) Eliminating directional bias produces desirable grouping of gradients of prompts A and C. The gradients of a safe prompt (A), an unsafe prompt (B) and the prompt to be classified (C) are illustrated as triangles with yellow, red and light green color in an LLM's representation space.
  • Figure 2: The flowchart of our proposed Gradient Co-occurrence method contains two main steps. (1). The first step extracts the safe and unsafe parameter gradients by computing the gradients from safe/unsafe reference prompts and removing corresponding directional and magnitude biases. (2). The second step aggregate the gradients' co-occurrence scores to determine the safety of the input prompt.
  • Figure 3: Performance Variation on the XSTest Dataset with Varying Numbers of Safe/Unsafe Reference Pairs.
  • Figure 4: The effects of our method across different base models. The test dataset is XSTest.
  • Figure 5: The performance of our method on models of different size scales. The selected models are from the Qwen-2.5-Instruct series, ranging from 0.5B to 14B, and the test dataset is XSTest.