Refusal Behavior in Large Language Models: A Nonlinear Perspective

Fabian Hildebrandt; Andreas Maier; Patrick Krauss; Achim Schilling

Refusal Behavior in Large Language Models: A Nonlinear Perspective

Fabian Hildebrandt, Andreas Maier, Patrick Krauss, Achim Schilling

TL;DR

Investigating refusal behavior across six LLMs from three architectural families reveals that refusal mechanisms exhibit nonlinear, multidimensional characteristics that vary by model architecture and layer, highlighting the need for nonlinear interpretability in alignment research and inform safer AI deployment strategies.

Abstract

Refusal behavior in large language models (LLMs) enables them to decline responding to harmful, unethical, or inappropriate prompts, ensuring alignment with ethical standards. This paper investigates refusal behavior across six LLMs from three architectural families. We challenge the assumption of refusal as a linear phenomenon by employing dimensionality reduction techniques, including PCA, t-SNE, and UMAP. Our results reveal that refusal mechanisms exhibit nonlinear, multidimensional characteristics that vary by model architecture and layer. These findings highlight the need for nonlinear interpretability to improve alignment research and inform safer AI deployment strategies.

Refusal Behavior in Large Language Models: A Nonlinear Perspective

TL;DR

Abstract

Paper Structure (17 sections, 6 equations, 4 figures, 2 tables)

This paper contains 17 sections, 6 equations, 4 figures, 2 tables.

Introduction
Methodology
Results
Discussion
Conclusion
Author contributions
Code availability
Acknowledgements

Figures (4)

Figure 1: Dimensionality-reduced residual activations using PCA, comparing harmful and harmless instructions across three different models and layers. a) Qwen2-1.5B-Instruct model (first, middle, last layers). b) Bloom-3b model (first, middle, last layers). c) Llama-3.2-3B-Instruct model (first, middle, last layers).
Figure 2: Dimensionality-reduced residual activations of the Qwen2-1.5B-Instruct model at layer 18, visualized using UMAP, showing distinct clusters for harmful and harmless instructions.
Figure 3: Dimensionality-reduced residual activations of the Qwen2-1.5B-Instruct model at layer 11, showing the emergence of sub-clusters for harmful instructions.
Figure 4: GDV, intra-class distance (compactness of harmful and harmless clusters), and inter-class distance (separation between clusters) of the dimensionality-reduced embeddings using PCA. a) Qwen2-1.5B-Instruct demonstrates early layer dominance of the refusal feature. b) Bloom-3b shows peak separability at early to intermediate layers but weaker discrimination in later layers. c) Llama-3.2-3B-Instruct displays progressively stronger separation.

Refusal Behavior in Large Language Models: A Nonlinear Perspective

TL;DR

Abstract

Refusal Behavior in Large Language Models: A Nonlinear Perspective

Authors

TL;DR

Abstract

Table of Contents

Figures (4)