Fine-tuning Large Language Models for Improving Factuality in Legal Question Answering

Yinghao Hu; Leilei Gan; Wenyi Xiao; Kun Kuang; Fei Wu

Fine-tuning Large Language Models for Improving Factuality in Legal Question Answering

Yinghao Hu, Leilei Gan, Wenyi Xiao, Kun Kuang, Fei Wu

TL;DR

The paper tackles the persistent hallucination problem in legal question answering by introducing LegalHalBench, a benchmark for evaluating five common legal hallucinations, and proposing a two-stage fine-tuning framework that combines supervised fine-tuning (SFT) with Hard Sample-aware Iterative Direct Preference Optimization (HIPO). By constructing a large, citation-rich training dataset and leveraging NHSR, Rel, and L_C metrics, the method achieves substantial factuality gains (e.g., NHSR up to 38.353% on GLM4-Chat-9B with SFT+HIPO, Rel up to 7.025, and L_C up to 9.079) and improved helpfulness across multiple baselines. Three HIPO iterations were found to stabilize performance, and human-consistency analyses corroborate the alignment of automatic metrics with expert judgments. The work provides a scalable data-generation pipeline and a targeted evaluation toolkit to advance the practical deployment of factual, legally grounded LLMs, with future work exploring continued gains beyond observed plateaus and deeper knowledge boundaries.

Abstract

Hallucination, or the generation of incorrect or fabricated information, remains a critical challenge in large language models (LLMs), particularly in high-stake domains such as legal question answering (QA). In order to mitigate the hallucination rate in legal QA, we first introduce a benchmark called LegalHalBench and three automatic metrics to evaluate the common hallucinations when LLMs answer legal questions. We then propose a hallucination mitigation method that integrates behavior cloning and a novel Hard Sample-aware Iterative Direct Preference Optimization (HIPO). We conduct extensive real-data experiments to validate the effectiveness of our approach. Our results demonstrate remarkable improvements in various metrics, including the newly proposed Non-Hallucinated Statute Rate, Statute Relevance Rate, Legal Claim Truthfulness, as well as traditional metrics such as METEOR, BERTScore, ROUGE-L, and win rates.

Fine-tuning Large Language Models for Improving Factuality in Legal Question Answering

TL;DR

Abstract

Paper Structure (54 sections, 5 equations, 8 figures, 9 tables, 1 algorithm)

This paper contains 54 sections, 5 equations, 8 figures, 9 tables, 1 algorithm.

Introduction
Related Work
Large Language Models in Legal Domain
Hallucinations in Large Language Models
Legal Hallucination Definition, Benchmark and Evaluation Metrics
Hallucinations in Legal Question Answering
Incorrect law name.
Incorrect legal code number.
Fabrication of legal provision.
Incorrect citation of legal provision.
Suggestions that contradict regulations.
Legal Hallucination Benchmark
Legal Hallucination Evaluation Metrics
Non-Hallucinated Statute Rate.
Statute Relevance Rate.
...and 39 more sections

Figures (8)

Figure 1: Hallucinations of LLMs in the legal question answering task.
Figure 2: An illustration of the training dataset construction pipeline.
Figure 3: The win rate of the LegalHalBench experiment. The chart presents the win rates of GLM4 Chat 9B-based HIPO against other LLMs, evaluated using the latest GPT-4-turbo.
Figure 4: Template for Extracting Statutes Using LLMs.
Figure 5: Prompt for the Statute Relevance Rate.
...and 3 more figures

Fine-tuning Large Language Models for Improving Factuality in Legal Question Answering

TL;DR

Abstract

Fine-tuning Large Language Models for Improving Factuality in Legal Question Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (8)