Table of Contents
Fetching ...

Gaps or Hallucinations? Gazing into Machine-Generated Legal Analysis for Fine-grained Text Evaluations

Abe Bohan Hou, William Jurayj, Nils Holzenberger, Andrew Blair-Stanek, Benjamin Van Durme

TL;DR

This work introduces the neutral notion of gaps, as opposed to hallucinations in a strict erroneous sense, to refer to the difference between human-written and machine-generated legal analysis.

Abstract

Large Language Models (LLMs) show promise as a writing aid for professionals performing legal analyses. However, LLMs can often hallucinate in this setting, in ways difficult to recognize by non-professionals and existing text evaluation metrics. In this work, we pose the question: when can machine-generated legal analysis be evaluated as acceptable? We introduce the neutral notion of gaps, as opposed to hallucinations in a strict erroneous sense, to refer to the difference between human-written and machine-generated legal analysis. Gaps do not always equate to invalid generation. Working with legal experts, we consider the CLERC generation task proposed in Hou et al. (2024b), leading to a taxonomy, a fine-grained detector for predicting gap categories, and an annotated dataset for automatic evaluation. Our best detector achieves 67% F1 score and 80% precision on the test set. Employing this detector as an automated metric on legal analysis generated by SOTA LLMs, we find around 80% contain hallucinations of different kinds.

Gaps or Hallucinations? Gazing into Machine-Generated Legal Analysis for Fine-grained Text Evaluations

TL;DR

This work introduces the neutral notion of gaps, as opposed to hallucinations in a strict erroneous sense, to refer to the difference between human-written and machine-generated legal analysis.

Abstract

Large Language Models (LLMs) show promise as a writing aid for professionals performing legal analyses. However, LLMs can often hallucinate in this setting, in ways difficult to recognize by non-professionals and existing text evaluation metrics. In this work, we pose the question: when can machine-generated legal analysis be evaluated as acceptable? We introduce the neutral notion of gaps, as opposed to hallucinations in a strict erroneous sense, to refer to the difference between human-written and machine-generated legal analysis. Gaps do not always equate to invalid generation. Working with legal experts, we consider the CLERC generation task proposed in Hou et al. (2024b), leading to a taxonomy, a fine-grained detector for predicting gap categories, and an annotated dataset for automatic evaluation. Our best detector achieves 67% F1 score and 80% precision on the test set. Employing this detector as an automated metric on legal analysis generated by SOTA LLMs, we find around 80% contain hallucinations of different kinds.
Paper Structure (32 sections, 2 equations, 19 figures, 2 tables)

This paper contains 32 sections, 2 equations, 19 figures, 2 tables.

Figures (19)

  • Figure 1: Detection results among the best detectors with different base models. $M\#ds$ means the best detector of base model $M$ has $d$ in-context demonstrations. GPT-4o#20s achieves the highest $mGEM$ and $mGP$, while Mistral-Nemo-Instruct-2407#16s achieves the highest $mGR$ and $mGF_1$.
  • Figure 2: Our proposed taxonomy of gaps. Each category is discussed in depth in Section \ref{['sec:taxonomy']}. We highlight Target mismatch ($G^2$) and its child nodes ($G^{12}, G^{13}, G^{14}$) as we show they do not indicate hallucinations as opposed to other gap categories (with examples in Appendix \ref{['app:examples']}). Meanwhile, citation content mismatch and intrinsic gaps are generally considered hallucination and both indicate invalidity of generation.
  • Figure 3: An example generated legal analysis from Clerc abe2024clerc, labeled with 2 (target mismatch) and given an explanation. See the full version of this example and prompts to LLM-based detectors in Figure \ref{['fig:full_example2']}, \ref{['fig:prompts']}.
  • Figure 4: Detection results of the GPT-4o detector with different number of in-context demonstrations. The full 20-shot detector yields the best overall detection accuracy, while 16-shot has a marginal drop in accuracy.
  • Figure 5: Detection results of the Mistral-Nemo detector with different number of in-context demonstrations. The model achieves the maximal performance at 16 demonstrations and overfits at 20 demonstrations.
  • ...and 14 more figures