Mitigating Exaggerated Safety in Large Language Models
Ruchira Ray, Ruchi Bhalani
TL;DR
The paper tackles exaggerated safety in large language models, where safe prompts are inappropriately refused. By coupling XSTest prompts with interactive, contextual, and few-shot prompting, it maps model decision boundaries and mitigates refusals across Llama2, Gemma, Command R+, and Phi-3. The results show a substantial reduction in misclassifications, from about 25% at baseline to around 1.8% post-prompting, with gains ranging from 90% to 97% depending on the model and strategy. The work demonstrates that carefully designed prompting strategies can preserve helpfulness while maintaining safety, offering practical guidance for deploying safer LLMs and informing future safety tooling. The findings have significant implications for improving user experience and safety compliance in real-world AI systems.
Abstract
As the popularity of Large Language Models (LLMs) grow, combining model safety with utility becomes increasingly important. The challenge is making sure that LLMs can recognize and decline dangerous prompts without sacrificing their ability to be helpful. The problem of "exaggerated safety" demonstrates how difficult this can be. To reduce excessive safety behaviours -- which was discovered to be 26.1% of safe prompts being misclassified as dangerous and refused -- we use a combination of XSTest dataset prompts as well as interactive, contextual, and few-shot prompting to examine the decision bounds of LLMs such as Llama2, Gemma Command R+, and Phi-3. We find that few-shot prompting works best for Llama2, interactive prompting works best Gemma, and contextual prompting works best for Command R+ and Phi-3. Using a combination of these prompting strategies, we are able to mitigate exaggerated safety behaviors by an overall 92.9% across all LLMs. Our work presents a multiple prompting strategies to jailbreak LLMs' decision-making processes, allowing them to navigate the tight line between refusing unsafe prompts and remaining helpful.
