Better Bill GPT: Comparing Large Language Models against Legal Invoice Reviewers

Nick Whitehouse; Nicole Lincoln; Stephanie Yiu; Lizzie Catterson; Rivindu Perera

Better Bill GPT: Comparing Large Language Models against Legal Invoice Reviewers

Nick Whitehouse, Nicole Lincoln, Stephanie Yiu, Lizzie Catterson, Rivindu Perera

TL;DR

The paper benchmarks six generalist LLMs against human invoice reviewers (Experienced Lawyers, Early-Career Lawyers, and Legal Operations) on 50 legal invoices (492 line items) to evaluate accuracy, speed, and cost in legal invoice review. Using ground-truth consensus from expert reviewers and standardized billing guidelines, LLMs achieve up to 92% invoice-level accuracy and up to 0.806 line-item F-scores, while processing a task in as little as 3.6 seconds per invoice and reducing per-invoice costs to cents. These results demonstrate that LLMs outperform humans across all metrics, revealing AI’s potential to transform legal spend management and prompting consideration of hybrid workflows balancing automation with human oversight. The study discusses industry implications, adoption challenges, and avenues for future research to optimize AI-human collaboration in real-world invoice review.

Abstract

Legal invoice review is a costly, inconsistent, and time-consuming process, traditionally performed by Legal Operations, Lawyers or Billing Specialists who scrutinise billing compliance line by line. This study presents the first empirical comparison of Large Language Models (LLMs) against human invoice reviewers - Early-Career Lawyers, Experienced Lawyers, and Legal Operations Professionals-assessing their accuracy, speed, and cost-effectiveness. Benchmarking state-of-the-art LLMs against a ground truth set by expert legal professionals, our empirically substantiated findings reveal that LLMs decisively outperform humans across every metric. In invoice approval decisions, LLMs achieve up to 92% accuracy, surpassing the 72% ceiling set by experienced lawyers. On a granular level, LLMs dominate line-item classification, with top models reaching F-scores of 81%, compared to just 43% for the best-performing human group. Speed comparisons are even more striking - while lawyers take 194 to 316 seconds per invoice, LLMs are capable of completing reviews in as fast as 3.6 seconds. And cost? AI slashes review expenses by 99.97%, reducing invoice processing costs from an average of $4.27 per invoice for human invoice reviewers to mere cents. These results highlight the evolving role of AI in legal spend management. As law firms and corporate legal departments struggle with inefficiencies, this study signals a seismic shift: The era of LLM-powered legal spend management is not on the horizon, it has arrived. The challenge ahead is not whether AI can perform as well as human reviewers, but how legal teams will strategically incorporate it, balancing automation with human discretion.

Better Bill GPT: Comparing Large Language Models against Legal Invoice Reviewers

TL;DR

Abstract

Better Bill GPT: Comparing Large Language Models against Legal Invoice Reviewers

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)