Fine-tuning Large Language Models for Improving Factuality in Legal Question Answering
Yinghao Hu, Leilei Gan, Wenyi Xiao, Kun Kuang, Fei Wu
TL;DR
The paper tackles the persistent hallucination problem in legal question answering by introducing LegalHalBench, a benchmark for evaluating five common legal hallucinations, and proposing a two-stage fine-tuning framework that combines supervised fine-tuning (SFT) with Hard Sample-aware Iterative Direct Preference Optimization (HIPO). By constructing a large, citation-rich training dataset and leveraging NHSR, Rel, and L_C metrics, the method achieves substantial factuality gains (e.g., NHSR up to 38.353% on GLM4-Chat-9B with SFT+HIPO, Rel up to 7.025, and L_C up to 9.079) and improved helpfulness across multiple baselines. Three HIPO iterations were found to stabilize performance, and human-consistency analyses corroborate the alignment of automatic metrics with expert judgments. The work provides a scalable data-generation pipeline and a targeted evaluation toolkit to advance the practical deployment of factual, legally grounded LLMs, with future work exploring continued gains beyond observed plateaus and deeper knowledge boundaries.
Abstract
Hallucination, or the generation of incorrect or fabricated information, remains a critical challenge in large language models (LLMs), particularly in high-stake domains such as legal question answering (QA). In order to mitigate the hallucination rate in legal QA, we first introduce a benchmark called LegalHalBench and three automatic metrics to evaluate the common hallucinations when LLMs answer legal questions. We then propose a hallucination mitigation method that integrates behavior cloning and a novel Hard Sample-aware Iterative Direct Preference Optimization (HIPO). We conduct extensive real-data experiments to validate the effectiveness of our approach. Our results demonstrate remarkable improvements in various metrics, including the newly proposed Non-Hallucinated Statute Rate, Statute Relevance Rate, Legal Claim Truthfulness, as well as traditional metrics such as METEOR, BERTScore, ROUGE-L, and win rates.
