Analyzing Images of Legal Documents: Toward Multi-Modal LLMs for Access to Justice
Hannes Westermann, Jaromir Savelka
TL;DR
The paper tackles access-to-justice barriers by evaluating multi-modal LLMs, specifically GPT-4o, on images of legal forms to extract structured data. Using Ontario's Residential Tenancy Agreement first page, the authors create 3 scenarios and 5 image formats to test 14 target fields, reporting an overall 73% extraction accuracy with strong dependence on image quality and format. Field localization is generally robust, but challenges persist for handwriting-like inputs, rare names, and numbers, highlighting both potential and gaps for real-world deployment. The work demonstrates a path toward image-based data capture to assist laypeople and self-represented litigants, while noting risks related to bias and the digital divide and outlining plans for scaling and integration with legal guidance systems.
Abstract
Interacting with the legal system and the government requires the assembly and analysis of various pieces of information that can be spread across different (paper) documents, such as forms, certificates and contracts (e.g. leases). This information is required in order to understand one's legal rights, as well as to fill out forms to file claims in court or obtain government benefits. However, finding the right information, locating the correct forms and filling them out can be challenging for laypeople. Large language models (LLMs) have emerged as a powerful technology that has the potential to address this gap, but still rely on the user to provide the correct information, which may be challenging and error-prone if the information is only available in complex paper documents. We present an investigation into utilizing multi-modal LLMs to analyze images of handwritten paper forms, in order to automatically extract relevant information in a structured format. Our initial results are promising, but reveal some limitations (e.g., when the image quality is low). Our work demonstrates the potential of integrating multi-modal LLMs to support laypeople and self-represented litigants in finding and assembling relevant information.
