LLMs Provide Unstable Answers to Legal Questions
Andrew Blair-Stanek, Benjamin Van Durme
TL;DR
The paper reveals substantial instability in leading LLMs when answering hard legal questions, even under near-deterministic prompting ($\text{temperature}=0$). It introduces a $500$-question appellate dataset and a rigorous, repeatable probing protocol to measure stability across models, revealing instability rates up to $50.4\%$ and only modest accuracy slightly above chance. The findings show that models often agree more with each other than with actual court outcomes, highlighting reliability gaps for legal AI deployments. The work cautions practitioners to scrutinize stability alongside accuracy and motivates further research into deterministic prompting and model behavior in legal domains.
Abstract
An LLM is stable if it reaches the same conclusion when asked the identical question multiple times. We find leading LLMs like gpt-4o, claude-3.5, and gemini-1.5 are unstable when providing answers to hard legal questions, even when made as deterministic as possible by setting temperature to 0. We curate and release a novel dataset of 500 legal questions distilled from real cases, involving two parties, with facts, competing legal arguments, and the question of which party should prevail. When provided the exact same question, we observe that LLMs sometimes say one party should win, while other times saying the other party should win. This instability has implications for the increasing numbers of legal AI products, legal processes, and lawyers relying on these LLMs.
