Gaps or Hallucinations? Gazing into Machine-Generated Legal Analysis for Fine-grained Text Evaluations

Abe Bohan Hou; William Jurayj; Nils Holzenberger; Andrew Blair-Stanek; Benjamin Van Durme

Gaps or Hallucinations? Gazing into Machine-Generated Legal Analysis for Fine-grained Text Evaluations

Abe Bohan Hou, William Jurayj, Nils Holzenberger, Andrew Blair-Stanek, Benjamin Van Durme

TL;DR

This work introduces the neutral notion of gaps, as opposed to hallucinations in a strict erroneous sense, to refer to the difference between human-written and machine-generated legal analysis.

Abstract

Large Language Models (LLMs) show promise as a writing aid for professionals performing legal analyses. However, LLMs can often hallucinate in this setting, in ways difficult to recognize by non-professionals and existing text evaluation metrics. In this work, we pose the question: when can machine-generated legal analysis be evaluated as acceptable? We introduce the neutral notion of gaps, as opposed to hallucinations in a strict erroneous sense, to refer to the difference between human-written and machine-generated legal analysis. Gaps do not always equate to invalid generation. Working with legal experts, we consider the CLERC generation task proposed in Hou et al. (2024b), leading to a taxonomy, a fine-grained detector for predicting gap categories, and an annotated dataset for automatic evaluation. Our best detector achieves 67% F1 score and 80% precision on the test set. Employing this detector as an automated metric on legal analysis generated by SOTA LLMs, we find around 80% contain hallucinations of different kinds.

Gaps or Hallucinations? Gazing into Machine-Generated Legal Analysis for Fine-grained Text Evaluations

TL;DR

This work introduces the neutral notion of gaps, as opposed to hallucinations in a strict erroneous sense, to refer to the difference between human-written and machine-generated legal analysis.

Abstract

Paper Structure (32 sections, 2 equations, 19 figures, 2 tables)

This paper contains 32 sections, 2 equations, 19 figures, 2 tables.

Introduction
Background
Legal Analysis Generation
Hallucination
Hallucination in Legal Generation
A Taxonomy of Gaps
Intrinsic Gaps
Extrinsic Gaps
Target Mismatch
Citation Content Mismatch
When Are Legal Analyses Unacceptable?
Gap Detection
Problem Formulation
Experimental Setup
Detection Results
...and 17 more sections

Figures (19)

Figure 1: Detection results among the best detectors with different base models. $M\#ds$ means the best detector of base model $M$ has $d$ in-context demonstrations. GPT-4o#20s achieves the highest $mGEM$ and $mGP$, while Mistral-Nemo-Instruct-2407#16s achieves the highest $mGR$ and $mGF_1$.
Figure 2: Our proposed taxonomy of gaps. Each category is discussed in depth in Section \ref{['sec:taxonomy']}. We highlight Target mismatch ($G^2$) and its child nodes ($G^{12}, G^{13}, G^{14}$) as we show they do not indicate hallucinations as opposed to other gap categories (with examples in Appendix \ref{['app:examples']}). Meanwhile, citation content mismatch and intrinsic gaps are generally considered hallucination and both indicate invalidity of generation.
Figure 3: An example generated legal analysis from Clerc abe2024clerc, labeled with 2 (target mismatch) and given an explanation. See the full version of this example and prompts to LLM-based detectors in Figure \ref{['fig:full_example2']}, \ref{['fig:prompts']}.
Figure 4: Detection results of the GPT-4o detector with different number of in-context demonstrations. The full 20-shot detector yields the best overall detection accuracy, while 16-shot has a marginal drop in accuracy.
Figure 5: Detection results of the Mistral-Nemo detector with different number of in-context demonstrations. The model achieves the maximal performance at 16 demonstrations and overfits at 20 demonstrations.
...and 14 more figures

Gaps or Hallucinations? Gazing into Machine-Generated Legal Analysis for Fine-grained Text Evaluations

TL;DR

Abstract

Gaps or Hallucinations? Gazing into Machine-Generated Legal Analysis for Fine-grained Text Evaluations

Authors

TL;DR

Abstract

Table of Contents

Figures (19)