Table of Contents
Fetching ...

SAFE: A Sparse Autoencoder-Based Framework for Robust Query Enrichment and Hallucination Mitigation in LLMs

Samir Abdaljalil, Filippo Pallucchini, Andrea Seveso, Hasan Kurban, Fabio Mercorio, Erchin Serpedin

TL;DR

SAFE tackles the pervasive issue of hallucinations in LLMs by coupling entropy-based uncertainty detection with SAE-derived, interpretable feature enrichment. The method flags high-entropy responses and enriches the input with semantically grounded features, steering the model toward relevant knowledge without retraining. Across TruthfulQA, BioASQ, and WikiDoc, SAFE improves accuracy and markedly reduces output entropy, with larger gains observed for certain models like Llama3-8b. The approach is lightweight, training-free, and demonstrates strong potential for robust, explainable LLM inference in critical QA tasks.

Abstract

Despite the state-of-the-art performance of Large Language Models (LLMs), these models often suffer from hallucinations, which can undermine their performance in critical applications. In this work, we propose SAFE, a novel method for detecting and mitigating hallucinations by leveraging Sparse Autoencoders (SAEs). While hallucination detection techniques and SAEs have been explored independently, their synergistic application in a comprehensive system, particularly for hallucination-aware query enrichment, has not been fully investigated. To validate the effectiveness of SAFE, we evaluate it on two models with available SAEs across three diverse cross-domain datasets designed to assess hallucination problems. Empirical results demonstrate that SAFE consistently improves query generation accuracy and mitigates hallucinations across all datasets, achieving accuracy improvements of up to 29.45%.

SAFE: A Sparse Autoencoder-Based Framework for Robust Query Enrichment and Hallucination Mitigation in LLMs

TL;DR

SAFE tackles the pervasive issue of hallucinations in LLMs by coupling entropy-based uncertainty detection with SAE-derived, interpretable feature enrichment. The method flags high-entropy responses and enriches the input with semantically grounded features, steering the model toward relevant knowledge without retraining. Across TruthfulQA, BioASQ, and WikiDoc, SAFE improves accuracy and markedly reduces output entropy, with larger gains observed for certain models like Llama3-8b. The approach is lightweight, training-free, and demonstrates strong potential for robust, explainable LLM inference in critical QA tasks.

Abstract

Despite the state-of-the-art performance of Large Language Models (LLMs), these models often suffer from hallucinations, which can undermine their performance in critical applications. In this work, we propose SAFE, a novel method for detecting and mitigating hallucinations by leveraging Sparse Autoencoders (SAEs). While hallucination detection techniques and SAEs have been explored independently, their synergistic application in a comprehensive system, particularly for hallucination-aware query enrichment, has not been fully investigated. To validate the effectiveness of SAFE, we evaluate it on two models with available SAEs across three diverse cross-domain datasets designed to assess hallucination problems. Empirical results demonstrate that SAFE consistently improves query generation accuracy and mitigates hallucinations across all datasets, achieving accuracy improvements of up to 29.45%.

Paper Structure

This paper contains 31 sections, 7 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Illustrative example of SAFE in action. The sample question is taken from the TruthfulQA lin-etal-2022-truthfulqa dataset, and the response is generated by Gemma-2-9b team2024gemma.
  • Figure 2: Overview of the SAFE pipeline. The process involves two primary stages: (1) Uncertainty Assessment, where the response variability of an LLM is measured via entropy calculations across multiple generations. If the entropy surpasses a predefined threshold ($\phi$), the system proceeds to (2) Query Enrichment, where the query and responses are processed through a Sparse Autoencoder (SAE) to extract informative features that enrich the original query.