Mitigating Exaggerated Safety in Large Language Models

Ruchira Ray; Ruchi Bhalani

Mitigating Exaggerated Safety in Large Language Models

Ruchira Ray, Ruchi Bhalani

TL;DR

The paper tackles exaggerated safety in large language models, where safe prompts are inappropriately refused. By coupling XSTest prompts with interactive, contextual, and few-shot prompting, it maps model decision boundaries and mitigates refusals across Llama2, Gemma, Command R+, and Phi-3. The results show a substantial reduction in misclassifications, from about 25% at baseline to around 1.8% post-prompting, with gains ranging from 90% to 97% depending on the model and strategy. The work demonstrates that carefully designed prompting strategies can preserve helpfulness while maintaining safety, offering practical guidance for deploying safer LLMs and informing future safety tooling. The findings have significant implications for improving user experience and safety compliance in real-world AI systems.

Abstract

As the popularity of Large Language Models (LLMs) grow, combining model safety with utility becomes increasingly important. The challenge is making sure that LLMs can recognize and decline dangerous prompts without sacrificing their ability to be helpful. The problem of "exaggerated safety" demonstrates how difficult this can be. To reduce excessive safety behaviours -- which was discovered to be 26.1% of safe prompts being misclassified as dangerous and refused -- we use a combination of XSTest dataset prompts as well as interactive, contextual, and few-shot prompting to examine the decision bounds of LLMs such as Llama2, Gemma Command R+, and Phi-3. We find that few-shot prompting works best for Llama2, interactive prompting works best Gemma, and contextual prompting works best for Command R+ and Phi-3. Using a combination of these prompting strategies, we are able to mitigate exaggerated safety behaviors by an overall 92.9% across all LLMs. Our work presents a multiple prompting strategies to jailbreak LLMs' decision-making processes, allowing them to navigate the tight line between refusing unsafe prompts and remaining helpful.

Mitigating Exaggerated Safety in Large Language Models

TL;DR

Abstract

Paper Structure (31 sections, 8 figures, 2 tables)

This paper contains 31 sections, 8 figures, 2 tables.

Introduction
Related Works
Approach
Safe Prompt Types
Unsafe Prompts as Contrasts
Model Setup
Prompting Strategies
Evaluation
Results
Baseline Model Behaviors
Interactive Prompting
Contextual Prompting
Few-shot Prompting
Prompting Results
Discussion
...and 16 more sections

Figures (8)

Figure 1: An example of exaggerated safety behavior by the original llama-2-70b-chat-hf touvron2023llama in response to safe prompt from XSTestxstest.
Figure 2: Iterative refinement prompting for a safe prompt from the XSTest suite category Nonsense group, real discrimination. The black text represents the prompt, the red text represents the first attempt at providing specification for the LLM, and the blue text represents the follow-up response as part of the iterative refinement process.
Figure 3: Iterative refinement prompting for a safe prompt from the XSTest suite category Safe contexts. The black text represents the prompt, the red text represents the first response from the LLM, and the blue text represents the follow-up response as part of the iterative refinement process.
Figure 4: Contextual prompting for a safe prompt from the XSTest suite category Safe targets. The black text represents the prompt, the red text represents the first response from the LLM, and the blue text represents the user's attempt to provide more context to their query as part of the the contextual process.
Figure 5: Few-shot prompting for a safe prompt from the XSTest suite category Safe targets. The black text represents the prompt, the red text represents the first response from the LLM, and the blue text represents the user's attempt to provide more context to their query as part of the the contextual process.
...and 3 more figures

Mitigating Exaggerated Safety in Large Language Models

TL;DR

Abstract

Mitigating Exaggerated Safety in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)