Table of Contents
Fetching ...

Steering Over-refusals Towards Safety in Retrieval Augmented Generation

Utsav Maskey, Mark Dras, Usman Naseem

TL;DR

The paper tackles the problem of over-refusal in retrieval-augmented generation (RAG), where safety-aligned models decline benign requests due to contaminated retrieved content. It introduces RagRefuse, a domain-stratified benchmark spanning six domains and multiple contamination patterns, to quantify how query intent, context contamination, domain priors, and harmful-text density drive refusals. To mitigate this, it proposes SafeRAG-Steering, a model-centric, zero-shot embedding-edit approach that steers intermediate representations toward safe output regions at inference time without retraining. Experiments on Llama-3.1-8B-Instruct and Qwen1.5-7B-Instruct show substantial reductions in over-refusal while preserving legitimate refusals, indicating practical applicability for production RAG systems.

Abstract

Safety alignment in large language models (LLMs) induces over-refusals -- where LLMs decline benign requests due to aggressive safety filters. We analyze this phenomenon in retrieval-augmented generation (RAG), where both the query intent and retrieved context properties influence refusal behavior. We construct RagRefuse, a domain-stratified benchmark spanning medical, chemical, and open domains, pairing benign and harmful queries with controlled context contamination patterns and sizes. Our analysis shows that context arrangement / contamination, domain of query and context, and harmful-text density trigger refusals even on benign queries, with effects depending on model-specific alignment choices. To mitigate over-refusals, we introduce \textsc{SafeRAG-Steering}, a model-centric embedding intervention that steers the embedding regions towards the confirmed safe, non-refusing output regions at inference time. This reduces over-refusals in contaminated RAG pipelines while preserving legitimate refusals.

Steering Over-refusals Towards Safety in Retrieval Augmented Generation

TL;DR

The paper tackles the problem of over-refusal in retrieval-augmented generation (RAG), where safety-aligned models decline benign requests due to contaminated retrieved content. It introduces RagRefuse, a domain-stratified benchmark spanning six domains and multiple contamination patterns, to quantify how query intent, context contamination, domain priors, and harmful-text density drive refusals. To mitigate this, it proposes SafeRAG-Steering, a model-centric, zero-shot embedding-edit approach that steers intermediate representations toward safe output regions at inference time without retraining. Experiments on Llama-3.1-8B-Instruct and Qwen1.5-7B-Instruct show substantial reductions in over-refusal while preserving legitimate refusals, indicating practical applicability for production RAG systems.

Abstract

Safety alignment in large language models (LLMs) induces over-refusals -- where LLMs decline benign requests due to aggressive safety filters. We analyze this phenomenon in retrieval-augmented generation (RAG), where both the query intent and retrieved context properties influence refusal behavior. We construct RagRefuse, a domain-stratified benchmark spanning medical, chemical, and open domains, pairing benign and harmful queries with controlled context contamination patterns and sizes. Our analysis shows that context arrangement / contamination, domain of query and context, and harmful-text density trigger refusals even on benign queries, with effects depending on model-specific alignment choices. To mitigate over-refusals, we introduce \textsc{SafeRAG-Steering}, a model-centric embedding intervention that steers the embedding regions towards the confirmed safe, non-refusing output regions at inference time. This reduces over-refusals in contaminated RAG pipelines while preserving legitimate refusals.

Paper Structure

This paper contains 21 sections, 4 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Over-refusal in RAG. Over-refusal occurs when benign queries are refused. Refusal behavior depend on user query's intent and context contamination. Abbreviations: B (Benign), H (Harmful).
  • Figure 2: Over-Refusal Rate (refusal to benign queries) on Text Domains and Context Contamination Combinations (a), and the Frequency of harmful contexts (c)---which the models should not refuse. Similarly, the Refusal rate on Contamination combination (b) and Text Domains (d) compares how domains and contamination affect refusals.