BLT: Can Large Language Models Handle Basic Legal Text?

Andrew Blair-Stanek; Nils Holzenberger; Benjamin Van Durme

BLT: Can Large Language Models Handle Basic Legal Text?

Andrew Blair-Stanek, Nils Holzenberger, Benjamin Van Durme

TL;DR

A benchmark consisting of examples that lawyers and paralegals would expect LLMs to handle zero-shot, such as looking up the text at a line of a witness deposition or at a subsection of a contract is created.

Abstract

We find that the best publicly available LLMs like GPT-4 and Claude currently perform poorly on basic legal text handling. This motivates the creation of a benchmark consisting of examples that lawyers and paralegals would expect LLMs to handle zero-shot, such as looking up the text at a line of a witness deposition or at a subsection of a contract. LLMs' poor performance on this benchmark casts into doubt their reliability as-is for legal practice. However, fine-tuning on our training set brings even a small model to near-perfect performance. This benchmark will be useful for fine-tuning LLMs for downstream legal tasks, as well as for tracking LLMs' reliability as-is for basic legal tasks.

BLT: Can Large Language Models Handle Basic Legal Text?

TL;DR

Abstract

Paper Structure (28 sections, 4 figures, 5 tables)

This paper contains 28 sections, 4 figures, 5 tables.

Introduction
Background
Legal Use of LLMs
Related LLM Evaluations
The BLT Benchmark
Deposition Transcripts
Synthetic Sections
U.S. Code
General Considerations
Results and Discussion
GPT-4 on transcript text->cite
GPT-4 on transcript cite->text
Poor Performance on synthetic cite->text
Problem Revealed by cite->amended
Fine-Tuning
...and 13 more sections

Figures (4)

Figure 1: GPT-4o incorrectly answering a simple question about a page from a line-numbered witness deposition transcript. Line numbers, exactly as shown above, are passed to GPT-4o, which incorrectly answers 13% of such one-page, 15-line deposition retrieval prompts.
Figure 2: Example of GPT-4 incorrectly answering defined$\rightarrow$cite question with a 2-deep, 2-wide synthetic section. The correct answer is "section 5217(c)’’.
Figure 3: Graph of location of relevant line versus accuracy on both transcript cite$\rightarrow$text and text$\rightarrow$cite on 5,000 prompts to GPT-4.
Figure 4: Graph of location of requested cite versus accuracy for 5,000 synthetic cite$\rightarrow$text prompts, all using 3-wide, 4-deep synthetic sections, which are 127 lines long. Note that each first subdivision (e.g., (a), (1)) is used for a "General Rule" that has few lines, so such subdivisions are not included in this graph.

BLT: Can Large Language Models Handle Basic Legal Text?

TL;DR

Abstract

BLT: Can Large Language Models Handle Basic Legal Text?

Authors

TL;DR

Abstract

Table of Contents

Figures (4)