Table of Contents
Fetching ...

Better Call CLAUSE: A Discrepancy Benchmark for Auditing LLMs Legal Reasoning Capabilities

Manan Roy Choudhury, Adithya Chandramouli, Mannan Anand, Vivek Gupta

TL;DR

CLAUSE introduces a scalable, AI-generated benchmark to stress-test LLMs on legal reasoning by perturbing real contracts from CUAD and ContractNLI across ten categories of discrepancies. The pipeline combines persona-driven generation, retrieval-augmented grounding, and expert validation to produce high-quality, legally grounded contrived examples with rich metadata. The authors implement a three-task evaluation framework—binary discrepancy detection, contradiction-type classification, and span-level explanations with legal grounding—coupled with a tiered prompting strategy and a comprehensive experimental evaluation across multiple model families. Findings reveal persistent gaps in current LLMs’ ability to detect and justify nuanced contractual inconsistencies, with generalization harder on ContractNLI than CUAD and one-shot prompting offering limited, balancing benefits. Overall, CLAUSE provides a foundational diagnostic tool to identify brittle legal reasoning in AI systems and to guide the development of safer, more transparent legal AI workflows with human oversight.

Abstract

The rapid integration of large language models (LLMs) into high-stakes legal work has exposed a critical gap: no benchmark exists to systematically stress-test their reliability against the nuanced, adversarial, and often subtle flaws present in real-world contracts. To address this, we introduce CLAUSE, a first-of-its-kind benchmark designed to evaluate the fragility of an LLM's legal reasoning. We study the capabilities of LLMs to detect and reason about fine-grained discrepancies by producing over 7500 real-world perturbed contracts from foundational datasets like CUAD and ContractNLI. Our novel, persona-driven pipeline generates 10 distinct anomaly categories, which are then validated against official statutes using a Retrieval-Augmented Generation (RAG) system to ensure legal fidelity. We use CLAUSE to evaluate leading LLMs' ability to detect embedded legal flaws and explain their significance. Our analysis shows a key weakness: these models often miss subtle errors and struggle even more to justify them legally. Our work outlines a path to identify and correct such reasoning failures in legal AI.

Better Call CLAUSE: A Discrepancy Benchmark for Auditing LLMs Legal Reasoning Capabilities

TL;DR

CLAUSE introduces a scalable, AI-generated benchmark to stress-test LLMs on legal reasoning by perturbing real contracts from CUAD and ContractNLI across ten categories of discrepancies. The pipeline combines persona-driven generation, retrieval-augmented grounding, and expert validation to produce high-quality, legally grounded contrived examples with rich metadata. The authors implement a three-task evaluation framework—binary discrepancy detection, contradiction-type classification, and span-level explanations with legal grounding—coupled with a tiered prompting strategy and a comprehensive experimental evaluation across multiple model families. Findings reveal persistent gaps in current LLMs’ ability to detect and justify nuanced contractual inconsistencies, with generalization harder on ContractNLI than CUAD and one-shot prompting offering limited, balancing benefits. Overall, CLAUSE provides a foundational diagnostic tool to identify brittle legal reasoning in AI systems and to guide the development of safer, more transparent legal AI workflows with human oversight.

Abstract

The rapid integration of large language models (LLMs) into high-stakes legal work has exposed a critical gap: no benchmark exists to systematically stress-test their reliability against the nuanced, adversarial, and often subtle flaws present in real-world contracts. To address this, we introduce CLAUSE, a first-of-its-kind benchmark designed to evaluate the fragility of an LLM's legal reasoning. We study the capabilities of LLMs to detect and reason about fine-grained discrepancies by producing over 7500 real-world perturbed contracts from foundational datasets like CUAD and ContractNLI. Our novel, persona-driven pipeline generates 10 distinct anomaly categories, which are then validated against official statutes using a Retrieval-Augmented Generation (RAG) system to ensure legal fidelity. We use CLAUSE to evaluate leading LLMs' ability to detect embedded legal flaws and explain their significance. Our analysis shows a key weakness: these models often miss subtle errors and struggle even more to justify them legally. Our work outlines a path to identify and correct such reasoning failures in legal AI.

Paper Structure

This paper contains 53 sections, 1 equation, 4 figures, 11 tables.

Figures (4)

  • Figure 1: CLAUSE pipeline: Data Generation, AI grounding, and expert validation of legal and Evaluation & Analysis.
  • Figure 2: Comparison of model performance on CUAD and NLI datasets across L1 and L2 levels. The top two plots show Miss and Extra metrics for CUAD, while the bottom two correspond to NLI.
  • Figure 3: Example JSON structure for in-text contradiction perturbations, showing the metadata schema used throughout the dataset.
  • Figure 4: Example JSON structure for legal contradiction perturbations, including law citations and statutory references.