Table of Contents
Fetching ...

Hallucination reduction with CASAL: Contrastive Activation Steering For Amortized Learning

Wannan, Yang, Xinchi Qiu, Lei Yu, Yuchen Zhang, Aobo Yang, Narine Kokhlikyan, Nicola Cancedda, Diego Garcia-Olano

TL;DR

CASAL tackles LLM hallucinations by embedding known-vs-unknown knowledge boundaries directly into model weights through an offline, three-step pipeline: probe the model’s knowledge boundary, construct contrastive activation steering, and train a lightweight subnetwork to approximate steering. This amortized approach yields substantial hallucination reductions (≈30–40%) with far greater data and compute efficiency than LoRA-based baselines, while preserving known-answer accuracy and demonstrating strong generalization to OOD data and multimodal settings. The method is architecture- and modality-agnostic, extending to MoE models and vision-language tasks, and establishes a link between interpretability-inspired representations and practical deployment. Overall, CASAL offers a scalable, production-friendly strategy to mitigate model uncertainty without sacrificing capability.

Abstract

Large Language Models (LLMs) exhibit impressive capabilities but often hallucinate, confidently providing incorrect answers instead of admitting ignorance. Prior work has shown that models encode linear representations of their own knowledge and that activation steering can reduce hallucinations. These approaches, however, require real-time monitoring and intervention during inference. We introduce Contrastive Activation Steering for Amortized Learning (CASAL), an efficient algorithm that connects interpretability with amortized optimization. CASAL directly bakes the benefits of activation steering into model's weights. Once trained, LLMs answer questions they know while abstaining from answering those they do not. CASAL's light-weight design requires training only a submodule of a single transformer layer and yet reduces hallucination by 30%-40% across multiple short-form QA benchmarks. CASAL is 30x more compute-efficient and 20x more data-efficient than strong LoRA-based baselines such as SFT and DPO, boosting its practical applicability in data scarce domains. Importantly, CASAL also generalizes effectively to out-of-distribution (OOD) domains. We showcase CASAL's flexibility in mitigating hallucinations in both text-only and vision-language models. To our knowledge, CASAL is the first steering-based training method that has been shown to be effective for both dense and Mixture-of-Experts (MoE) models. CASAL represents a promising step forward for applying interpretability-inspired method for practical deployment in production systems.

Hallucination reduction with CASAL: Contrastive Activation Steering For Amortized Learning

TL;DR

CASAL tackles LLM hallucinations by embedding known-vs-unknown knowledge boundaries directly into model weights through an offline, three-step pipeline: probe the model’s knowledge boundary, construct contrastive activation steering, and train a lightweight subnetwork to approximate steering. This amortized approach yields substantial hallucination reductions (≈30–40%) with far greater data and compute efficiency than LoRA-based baselines, while preserving known-answer accuracy and demonstrating strong generalization to OOD data and multimodal settings. The method is architecture- and modality-agnostic, extending to MoE models and vision-language tasks, and establishes a link between interpretability-inspired representations and practical deployment. Overall, CASAL offers a scalable, production-friendly strategy to mitigate model uncertainty without sacrificing capability.

Abstract

Large Language Models (LLMs) exhibit impressive capabilities but often hallucinate, confidently providing incorrect answers instead of admitting ignorance. Prior work has shown that models encode linear representations of their own knowledge and that activation steering can reduce hallucinations. These approaches, however, require real-time monitoring and intervention during inference. We introduce Contrastive Activation Steering for Amortized Learning (CASAL), an efficient algorithm that connects interpretability with amortized optimization. CASAL directly bakes the benefits of activation steering into model's weights. Once trained, LLMs answer questions they know while abstaining from answering those they do not. CASAL's light-weight design requires training only a submodule of a single transformer layer and yet reduces hallucination by 30%-40% across multiple short-form QA benchmarks. CASAL is 30x more compute-efficient and 20x more data-efficient than strong LoRA-based baselines such as SFT and DPO, boosting its practical applicability in data scarce domains. Importantly, CASAL also generalizes effectively to out-of-distribution (OOD) domains. We showcase CASAL's flexibility in mitigating hallucinations in both text-only and vision-language models. To our knowledge, CASAL is the first steering-based training method that has been shown to be effective for both dense and Mixture-of-Experts (MoE) models. CASAL represents a promising step forward for applying interpretability-inspired method for practical deployment in production systems.

Paper Structure

This paper contains 81 sections, 30 equations, 20 figures, 12 tables, 2 algorithms.

Figures (20)

  • Figure 1: Overview of the CASAL algorithm. (A) Knowledge Probing: CASAL starts by probing the model to figure out what it knows vs doesn't know. Multiple responses per query are sampled to classify queries as known ($\mathbf{\mathcal{D}_{k}}$) or unknown ( $\mathbf{\mathcal{D}_{u}}$). (B) Steering: Difference in means are computed to construct steering vectors ($\mathbf{v}^{L^*}_{u}$ and $\mathbf{v}^{L^*}_{k}$). Target activations ( $\textcolor{UnknownRed}{\mathbf{t}^{L^{*}}_{u}}$ and $\textcolor{KnownGreen}{\mathbf{t}^{L^{*}}_{k}}$) are obtained by adding these steering vectors to the residual stream activation. Pre-CASAL Behavior: Prior to training, the model often hallucinates and produces incorrect answers for unknown queries. (C) CASAL Training: CASAL training is essentially "amortized activation steering", where instead of repeatedly steering activations online, we train a small subnetwork (a single layer NN) to approximate the steering solution offline. (D) Post-CASAL Activations and Behavior: After training, the model learns a sharper representation with a clearer knowledge boundary. It maintains correct answers on known queries while abstaining from answering unknown ones.
  • Figure 2: CASAL is both sample efficient and compute efficient. (A--B) CASAL achieves strong hallucination reduction with orders-of-magnitude fewer training examples comparing to LoRA-based fine-tuning with SFT, DPO and GRPO. (C) CASAL is over $30\times$ more compute-efficient than PEFT baselines such as LoRA. (D) Hallucination reduction after CASAL training correlates with improved cluster separation between known and unknown queries, measured by silhouette score.
  • Figure 3: CASAL is architecture-agnostic. It effectively reduces hallucination for OLMoE. (A) Visualization of MLP activations from different experts in a MoE model before CASAL training. (B) CASAL applies a local representation loss on residual stream activations. During training, weights are updated on only a lightweight sub-module across experts. (C) Residual stream activations before and after CASAL training. (D) CASAL reduces hallucination rate on unknown queries while maintaining low refusal score and high accuracy for known queries.
  • Figure 4: Illustration of steering vector and target activation construction. (A) Mean activations at the target layer $L^*$ are computed for known queries ($\textcolor{KnownGreen}{\bar{a}^{L^*}_k}$) and unknown queries ($\textcolor{UnknownRed}{\bar{a}^{L^*}_u}$). (B) Steering vectors are defined by the difference of these means: $\textcolor{KnownGreen}{v^{L^*}_k} = \textcolor{KnownGreen}{\bar{a}^{L^*}_k} - \textcolor{UnknownRed}{\bar{a}^{L^*}_u}$ (pointing toward the known cluster) and $\textcolor{UnknownRed}{v^{L^*}_u} = \textcolor{UnknownRed}{\bar{a}^{L^*}_u} - \textcolor{KnownGreen}{\bar{a}^{L^*}_k}$ (pointing toward the unknown cluster). (C) Target activations are generated by shifting the raw activations $a^{L^*}(x)$ along the corresponding steering vector: $\textcolor{KnownGreen}{t^{L^*}_k(x)} = a^{L^*}(x) + \textcolor{KnownGreen}{v^{L^*}_k}$ for known queries, and $\textcolor{UnknownRed}{t^{L^*}_u(x)} = a^{L^*}(x) + \textcolor{UnknownRed}{v^{L^*}_u}$ for unknown queries. These target activations serve as supervision signals during CASAL training.
  • Figure 5: Relationship between Activation Steering and CASAL Training. (A) Activation Steering. At the target layer $L^*$, activations $a^{L^*}(x)$ for known and unknown queries are separated by computing mean representations across each group. Their difference defines steering vectors, which are applied to produce target activations $\textcolor{KnownGreen}{t^{L^*}_k(x)}$ (promoting answering for known queries) and $\textcolor{UnknownRed}{t^{L^*}_u(x)}$ (encouraging abstention for unknown queries). (B) CASAL Training. Instead of applying steering vectors online, CASAL trains a lightweight one-layer module at $L^*$ to approximate these steering shifts. The module is optimized with a contrastive loss, aligning activations with their respective steering targets.
  • ...and 15 more figures