SAFE: A Sparse Autoencoder-Based Framework for Robust Query Enrichment and Hallucination Mitigation in LLMs
Samir Abdaljalil, Filippo Pallucchini, Andrea Seveso, Hasan Kurban, Fabio Mercorio, Erchin Serpedin
TL;DR
SAFE tackles the pervasive issue of hallucinations in LLMs by coupling entropy-based uncertainty detection with SAE-derived, interpretable feature enrichment. The method flags high-entropy responses and enriches the input with semantically grounded features, steering the model toward relevant knowledge without retraining. Across TruthfulQA, BioASQ, and WikiDoc, SAFE improves accuracy and markedly reduces output entropy, with larger gains observed for certain models like Llama3-8b. The approach is lightweight, training-free, and demonstrates strong potential for robust, explainable LLM inference in critical QA tasks.
Abstract
Despite the state-of-the-art performance of Large Language Models (LLMs), these models often suffer from hallucinations, which can undermine their performance in critical applications. In this work, we propose SAFE, a novel method for detecting and mitigating hallucinations by leveraging Sparse Autoencoders (SAEs). While hallucination detection techniques and SAEs have been explored independently, their synergistic application in a comprehensive system, particularly for hallucination-aware query enrichment, has not been fully investigated. To validate the effectiveness of SAFE, we evaluate it on two models with available SAEs across three diverse cross-domain datasets designed to assess hallucination problems. Empirical results demonstrate that SAFE consistently improves query generation accuracy and mitigates hallucinations across all datasets, achieving accuracy improvements of up to 29.45%.
