Table of Contents
Fetching ...

BLT: Can Large Language Models Handle Basic Legal Text?

Andrew Blair-Stanek, Nils Holzenberger, Benjamin Van Durme

TL;DR

A benchmark consisting of examples that lawyers and paralegals would expect LLMs to handle zero-shot, such as looking up the text at a line of a witness deposition or at a subsection of a contract is created.

Abstract

We find that the best publicly available LLMs like GPT-4 and Claude currently perform poorly on basic legal text handling. This motivates the creation of a benchmark consisting of examples that lawyers and paralegals would expect LLMs to handle zero-shot, such as looking up the text at a line of a witness deposition or at a subsection of a contract. LLMs' poor performance on this benchmark casts into doubt their reliability as-is for legal practice. However, fine-tuning on our training set brings even a small model to near-perfect performance. This benchmark will be useful for fine-tuning LLMs for downstream legal tasks, as well as for tracking LLMs' reliability as-is for basic legal tasks.

BLT: Can Large Language Models Handle Basic Legal Text?

TL;DR

A benchmark consisting of examples that lawyers and paralegals would expect LLMs to handle zero-shot, such as looking up the text at a line of a witness deposition or at a subsection of a contract is created.

Abstract

We find that the best publicly available LLMs like GPT-4 and Claude currently perform poorly on basic legal text handling. This motivates the creation of a benchmark consisting of examples that lawyers and paralegals would expect LLMs to handle zero-shot, such as looking up the text at a line of a witness deposition or at a subsection of a contract. LLMs' poor performance on this benchmark casts into doubt their reliability as-is for legal practice. However, fine-tuning on our training set brings even a small model to near-perfect performance. This benchmark will be useful for fine-tuning LLMs for downstream legal tasks, as well as for tracking LLMs' reliability as-is for basic legal tasks.
Paper Structure (28 sections, 4 figures, 5 tables)

This paper contains 28 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: GPT-4o incorrectly answering a simple question about a page from a line-numbered witness deposition transcript. Line numbers, exactly as shown above, are passed to GPT-4o, which incorrectly answers 13% of such one-page, 15-line deposition retrieval prompts.
  • Figure 2: Example of GPT-4 incorrectly answering defined$\rightarrow$cite question with a 2-deep, 2-wide synthetic section. The correct answer is "section 5217(c)’’.
  • Figure 3: Graph of location of relevant line versus accuracy on both transcript cite$\rightarrow$text and text$\rightarrow$cite on 5,000 prompts to GPT-4.
  • Figure 4: Graph of location of requested cite versus accuracy for 5,000 synthetic cite$\rightarrow$text prompts, all using 3-wide, 4-deep synthetic sections, which are 127 lines long. Note that each first subdivision (e.g., (a), (1)) is used for a "General Rule" that has few lines, so such subdivisions are not included in this graph.