Mitigating Large Language Model Hallucination with Faithful Finetuning
Minda Hu, Bowei He, Yufei Wang, Liangyou Li, Chen Ma, Irwin King
TL;DR
Faithful Finetuning (F2) addresses QA hallucinations by explicitly modeling faithfulness through a multi-objective loss that decomposes QA into internal fact retrieval and fact-grounded QA, regularized by targeted hotspot and layer-focused fine-tuning. The method introduces entity- and attention-based heuristics to weight losses on hallucination-prone spans and selects hallucination-prone layers to fine-tune, guided by principles from TruthX. Empirical results on HaluEval, TruthfulQA, and FACTOR show that F2 improves truthfulness over vanilla models and complements representation-editing approaches, illustrating the value of explicit training objectives for reliable, knowledge-grounded LLMs. The work highlights practical gains for deploying safer LLMs in real-world QA tasks and suggests avenues for combining F2 with existing mitigation techniques for further robustness.
Abstract
Large language models (LLMs) have demonstrated remarkable performance on various natural language processing tasks. However, they are prone to generating fluent yet untruthful responses, known as "hallucinations". Hallucinations can lead to the spread of misinformation and cause harm in critical applications. Mitigating hallucinations is challenging as they arise from factors such as noisy data, model overconfidence, lack of knowledge, and the generation process itself. Recent efforts have attempted to address this issue through representation editing and decoding algorithms, reducing hallucinations without major structural changes or retraining. However, these approaches either implicitly edit LLMs' behavior in latent space or suppress the tendency to output unfaithful results during decoding instead of explicitly modeling on hallucination. In this work, we introduce Faithful Finetuning (F2), a novel method that explicitly models the process of faithful question answering through carefully designed loss functions during fine-tuning. We conduct extensive experiments on popular datasets and demonstrate that F2 achieves significant improvements over vanilla models and baselines.
