SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models

Somnath Banerjee; Sayan Layek; Soham Tripathy; Shanu Kumar; Animesh Mukherjee; Rima Hazra

SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models

Somnath Banerjee, Sayan Layek, Soham Tripathy, Shanu Kumar, Animesh Mukherjee, Rima Hazra

TL;DR

SafeInfer introduces a decoding-time, context-adaptive safety framework for large language models by coupling Safety Amplification (SA) with a Safety Guided Decoding Strategy (sGDS). SA injects a Safety Amplification Vector into a chosen hidden layer using safe demonstrations, while sGDS non-linearly unions the base output with a harmful-model distribution to produce a safe final output $M_t^{sf}$. Across base and edited models and multiple prompting styles, SafeInfer significantly reduces attack success rates while preserving core utilities, as demonstrated on diverse safety benchmarks including HarmEval. The approach leverages activation-patching of influential attention heads and a KL-based objective to constrain unsafe tokens, and is complemented by a new HarmEval safety benchmark and extensive jailbreak testing. Overall, SafeInfer provides a practical, model-agnostic augmentation to existing safety methods, enabling robust, context-aware safety without retraining.

Abstract

Safety-aligned language models often exhibit fragile and imbalanced safety mechanisms, increasing the likelihood of generating unsafe content. In addition, incorporating new knowledge through editing techniques to language models can further compromise safety. To address these issues, we propose SafeInfer, a context-adaptive, decoding-time safety alignment strategy for generating safe responses to user queries. SafeInfer comprises two phases: the safety amplification phase, which employs safe demonstration examples to adjust the model's hidden states and increase the likelihood of safer outputs, and the safety-guided decoding phase, which influences token selection based on safety-optimized distributions, ensuring the generated content complies with ethical guidelines. Further, we present HarmEval, a novel benchmark for extensive safety evaluations, designed to address potential misuse scenarios in accordance with the policies of leading AI tech giants.

SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models

TL;DR

. Across base and edited models and multiple prompting styles, SafeInfer significantly reduces attack success rates while preserving core utilities, as demonstrated on diverse safety benchmarks including HarmEval. The approach leverages activation-patching of influential attention heads and a KL-based objective to constrain unsafe tokens, and is complemented by a new HarmEval safety benchmark and extensive jailbreak testing. Overall, SafeInfer provides a practical, model-agnostic augmentation to existing safety methods, enabling robust, context-aware safety without retraining.

Abstract

Paper Structure (19 sections, 7 equations, 8 figures, 21 tables)

This paper contains 19 sections, 7 equations, 8 figures, 21 tables.

Introduction
Related work
SafeInfer: Context Adaptive Decoding Time Safety Alignment
Datasets
Experiments
Language models
Prompting technique
Baselines
Jailbreak methods
Evaluation metric
Obtaining the harmful model
Utility and over-safety test
Results
Conclusion
Hyperparameters
...and 4 more sections

Figures (8)

Figure 1: Blackbox illustration of SafeInfer.
Figure 2: Schematic diagram of the SafeInfer.
Figure 3: HarmEval: A dataset to test the harmfulness of LLMs. It has $\sim$550 questions across 11 standard policy violating categories.
Figure 4: Topic-wise ethical responses for the HarmEval dataset. The green area highlights the credibility and effectiveness of the SafeInfer strategy.
Figure 5: Speculative sampling for the HarmEval dataset. Calculations are performed for the Llama-2 model.
...and 3 more figures

SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models

TL;DR

Abstract

SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)