Evaluating AI for Law: Bridging the Gap with Open-Source Solutions

Rohan Bhambhoria; Samuel Dahan; Jonathan Li; Xiaodan Zhu

Evaluating AI for Law: Bridging the Gap with Open-Source Solutions

Rohan Bhambhoria, Samuel Dahan, Jonathan Li, Xiaodan Zhu

TL;DR

The paper addresses the risks of applying general-purpose AI to high-stakes legal tasks and argues for domain-specific, open-source solutions to improve accuracy, transparency, and access to justice. It introduces LegalQA, a high-quality legal QA dataset curated from lay questions and expert Canadian-law answers, along with Law Stack Exchange content, and proposes OpenJustice as a crowdsourced, open-source framework for building legal AI. Benchmarking with GPT-4 and Mixtral indicates that while GPT-4 attains low factual error, open-source models lag and exhibit issues such as missing citations and verbosity, underscoring the need for domain-focused methods. The authors propose a concrete OpenJustice architecture and a three-path framework (build, fine-tune, or train small models) supported by a data-centric development and evaluation pipeline to democratize robust, explainable legal AI that can enhance access to justice.

Abstract

This study evaluates the performance of general-purpose AI, like ChatGPT, in legal question-answering tasks, highlighting significant risks to legal professionals and clients. It suggests leveraging foundational models enhanced by domain-specific knowledge to overcome these issues. The paper advocates for creating open-source legal AI systems to improve accuracy, transparency, and narrative diversity, addressing general AI's shortcomings in legal contexts.

Evaluating AI for Law: Bridging the Gap with Open-Source Solutions

TL;DR

Abstract

Paper Structure (11 sections, 3 figures, 3 tables)

This paper contains 11 sections, 3 figures, 3 tables.

Introduction
Background: AI Is Not Yet Ready for Law
Datasets and Statistics
Annotation Guidelines
Experimental Setup
Results and Discussion
A Framework for Legal AI
The State of Legal AI
OpenJustice: A Recipe for Building a Crowdsourced Legal Language Model
Conclusion
Acknowledgements

Figures (3)

Figure 1: Distribution of sequence lengths for LegalQA and Law Stack Exchange. We measure the length in tokens (with byte-pair encoding) and combine the train and test sets.
Figure 2: Automatic evaluation results for (\ref{['fig:legalQAStats']}) LegalQA and (\ref{['fig:legalSEStats']}) Law Stack Exchange. Experimental setting described in Section \ref{['sec:experiments']}
Figure 3: Legal Community Feedback utilized for OpenJustice

Evaluating AI for Law: Bridging the Gap with Open-Source Solutions

TL;DR

Abstract

Evaluating AI for Law: Bridging the Gap with Open-Source Solutions

Authors

TL;DR

Abstract

Table of Contents

Figures (3)