Table of Contents
Fetching ...

Analyzing Images of Legal Documents: Toward Multi-Modal LLMs for Access to Justice

Hannes Westermann, Jaromir Savelka

TL;DR

The paper tackles access-to-justice barriers by evaluating multi-modal LLMs, specifically GPT-4o, on images of legal forms to extract structured data. Using Ontario's Residential Tenancy Agreement first page, the authors create 3 scenarios and 5 image formats to test 14 target fields, reporting an overall 73% extraction accuracy with strong dependence on image quality and format. Field localization is generally robust, but challenges persist for handwriting-like inputs, rare names, and numbers, highlighting both potential and gaps for real-world deployment. The work demonstrates a path toward image-based data capture to assist laypeople and self-represented litigants, while noting risks related to bias and the digital divide and outlining plans for scaling and integration with legal guidance systems.

Abstract

Interacting with the legal system and the government requires the assembly and analysis of various pieces of information that can be spread across different (paper) documents, such as forms, certificates and contracts (e.g. leases). This information is required in order to understand one's legal rights, as well as to fill out forms to file claims in court or obtain government benefits. However, finding the right information, locating the correct forms and filling them out can be challenging for laypeople. Large language models (LLMs) have emerged as a powerful technology that has the potential to address this gap, but still rely on the user to provide the correct information, which may be challenging and error-prone if the information is only available in complex paper documents. We present an investigation into utilizing multi-modal LLMs to analyze images of handwritten paper forms, in order to automatically extract relevant information in a structured format. Our initial results are promising, but reveal some limitations (e.g., when the image quality is low). Our work demonstrates the potential of integrating multi-modal LLMs to support laypeople and self-represented litigants in finding and assembling relevant information.

Analyzing Images of Legal Documents: Toward Multi-Modal LLMs for Access to Justice

TL;DR

The paper tackles access-to-justice barriers by evaluating multi-modal LLMs, specifically GPT-4o, on images of legal forms to extract structured data. Using Ontario's Residential Tenancy Agreement first page, the authors create 3 scenarios and 5 image formats to test 14 target fields, reporting an overall 73% extraction accuracy with strong dependence on image quality and format. Field localization is generally robust, but challenges persist for handwriting-like inputs, rare names, and numbers, highlighting both potential and gaps for real-world deployment. The work demonstrates a path toward image-based data capture to assist laypeople and self-represented litigants, while noting risks related to bias and the digital divide and outlining plans for scaling and integration with legal guidance systems.

Abstract

Interacting with the legal system and the government requires the assembly and analysis of various pieces of information that can be spread across different (paper) documents, such as forms, certificates and contracts (e.g. leases). This information is required in order to understand one's legal rights, as well as to fill out forms to file claims in court or obtain government benefits. However, finding the right information, locating the correct forms and filling them out can be challenging for laypeople. Large language models (LLMs) have emerged as a powerful technology that has the potential to address this gap, but still rely on the user to provide the correct information, which may be challenging and error-prone if the information is only available in complex paper documents. We present an investigation into utilizing multi-modal LLMs to analyze images of handwritten paper forms, in order to automatically extract relevant information in a structured format. Our initial results are promising, but reveal some limitations (e.g., when the image quality is low). Our work demonstrates the potential of integrating multi-modal LLMs to support laypeople and self-represented litigants in finding and assembling relevant information.

Paper Structure

This paper contains 13 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Excerpts from filled-out form in the different formats used for experiments, Scenario 2.
  • Figure 2: Diagram showing the experimental design.
  • Figure 3: Accuracy heatmap for different scenarios and formats.
  • Figure 4: Excerpts from data for street numbers. T referst to the target value, while P shows the value extracted by the model.
  • Figure 5: Excerpts from selected samples where the model was able to correctly extract the informationd despite poor data quality. T refers to the targer value, while P refers to the prediction of the model.