Table of Contents
Fetching ...

The Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLMs

Omar Mahmoud, Ali Khalil, Buddhika Laknath Semage, Thommen George Karimpanal, Santu Rana

TL;DR

The paper addresses the truthfulness–safety trade-off in LLM alignment, showing that increasing factual accuracy can weaken safety guardrails due to overlapping internal representations of refusal and hallucination. It introduces a Sparse Autoencoder (SAE)–based disentanglement and subspace-orthogonalized fine-tuning to separate refusal-related features from hallucination features, preserving the safety subspace during updates. Empirical results on LLaMA-3-8B-Instruct and Qwen-2.5-Instruct across six commonsense tasks and two harmful benchmarks demonstrate that the approach improves task utility while maintaining or enhancing safety, significantly reducing Attack Success Rates without sacrificing performance. The work highlights the importance of preserving refusal signals while enabling truthfulness, offering a practical method to mitigate hallucinations without compromising alignment, and discusses limitations related to interpretability, generalization, and scale.

Abstract

Hallucination in large language models (LLMs) has been widely studied in recent years, with progress in both detection and mitigation aimed at improving truthfulness. Yet, a critical side effect remains largely overlooked: enhancing truthfulness can negatively impact safety alignment. In this paper, we investigate this trade-off and show that increasing factual accuracy often comes at the cost of weakened refusal behavior. Our analysis reveals that this arises from overlapping components in the model that simultaneously encode hallucination and refusal information, leading alignment methods to suppress factual knowledge unintentionally. We further examine how fine-tuning on benign datasets, even when curated for safety, can degrade alignment for the same reason. To address this, we propose a method that disentangles refusal-related features from hallucination features using sparse autoencoders, and preserves refusal behavior during fine-tuning through subspace orthogonalization. This approach prevents hallucinations from increasing while maintaining safety alignment.We evaluate our method on commonsense reasoning tasks and harmful benchmarks (AdvBench and StrongReject). Results demonstrate that our approach preserves refusal behavior and task utility, mitigating the trade-off between truthfulness and safety.

The Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLMs

TL;DR

The paper addresses the truthfulness–safety trade-off in LLM alignment, showing that increasing factual accuracy can weaken safety guardrails due to overlapping internal representations of refusal and hallucination. It introduces a Sparse Autoencoder (SAE)–based disentanglement and subspace-orthogonalized fine-tuning to separate refusal-related features from hallucination features, preserving the safety subspace during updates. Empirical results on LLaMA-3-8B-Instruct and Qwen-2.5-Instruct across six commonsense tasks and two harmful benchmarks demonstrate that the approach improves task utility while maintaining or enhancing safety, significantly reducing Attack Success Rates without sacrificing performance. The work highlights the importance of preserving refusal signals while enabling truthfulness, offering a practical method to mitigate hallucinations without compromising alignment, and discusses limitations related to interpretability, generalization, and scale.

Abstract

Hallucination in large language models (LLMs) has been widely studied in recent years, with progress in both detection and mitigation aimed at improving truthfulness. Yet, a critical side effect remains largely overlooked: enhancing truthfulness can negatively impact safety alignment. In this paper, we investigate this trade-off and show that increasing factual accuracy often comes at the cost of weakened refusal behavior. Our analysis reveals that this arises from overlapping components in the model that simultaneously encode hallucination and refusal information, leading alignment methods to suppress factual knowledge unintentionally. We further examine how fine-tuning on benign datasets, even when curated for safety, can degrade alignment for the same reason. To address this, we propose a method that disentangles refusal-related features from hallucination features using sparse autoencoders, and preserves refusal behavior during fine-tuning through subspace orthogonalization. This approach prevents hallucinations from increasing while maintaining safety alignment.We evaluate our method on commonsense reasoning tasks and harmful benchmarks (AdvBench and StrongReject). Results demonstrate that our approach preserves refusal behavior and task utility, mitigating the trade-off between truthfulness and safety.

Paper Structure

This paper contains 34 sections, 7 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: The truthfulness–safety trade-off. Interventions that improve truthfulness—such as head steering, probing, or representation mapping—can unintentionally compromise safety by disrupting subspaces associated with refusal behavior. The diagram illustrates how enhancing truthfulness may lead to crossing the refusal boundary, potentially degrading safety unless refusal-related features are explicitly preserved.
  • Figure 2: Aligned vs. truth-seeking responses under intent ambiguity. The prompt concerns a sensitive, potentially harmful topic. The aligned model refuses, even when the phrasing is benign, prioritizing safety. The truth-seeking model answers with factual context (without slurs), improving informativeness but relaxing suppression. This illustrates our hypothesis: interventions that boost truthfulness can be exploited by intent-bearing prompts, weakening refusal safeguards unless refusal features are explicitly preserved.
  • Figure 3: Attack success rates on harmful safety benchmarks (Advbench and StrongReject) when evaluating two methods (ITI and TruthX) designed to improve factual truthfulness in LLMs, compared against the base model Llama-3-8B Instruct. Higher values indicate greater vulnerability to adversarial attacks.
  • Figure 4: Contrastive influence map comparing the base model and the LoRA-steered model (truthfulness mode). Heads supporting hallucinations are shown in red, while heads supporting truthfulness are shown in blue. Applying LoRA increases the influence of truthfulness heads and reduces the contribution of hallucination heads.
  • Figure 5: Attack success rate (ASR) for the base model compared to the model with refusal heads patched out. The increase in ASR demonstrates that these heads encode refusal behavior, and their removal weakens the model’s safety mechanisms.
  • ...and 2 more figures