Rethinking Legal Judgement Prediction in a Realistic Scenario in the Era of Large Language Models
Shubham Kumar Nigam, Aniket Deroy, Subhankar Maity, Arnab Bhattacharya
TL;DR
This work reframes Legal Judgment Prediction as a bench-side task in which decisions must be made from case facts, statutes, precedents, and arguments available at decision time. It combines fact extraction, summarization, hierarchical transformer architectures, and large language models with prompting strategies to predict judgments and generate explanations, evaluated via automatic metrics and human judgments. The study demonstrates that adding legal information improves prediction accuracy and that hierarchical transformers outperform traditional summarization, while GPT-3.5 Turbo with chain-of-thought prompts offers the strongest performance among LLMs, though still short of expert-level judgments. The findings highlight practical considerations for deploying legal AI, including explanation quality (Clarity and Linking) and the need for broader jurisdictional testing and multi-label outcomes in future work.
Abstract
This study investigates judgment prediction in a realistic scenario within the context of Indian judgments, utilizing a range of transformer-based models, including InLegalBERT, BERT, and XLNet, alongside LLMs such as Llama-2 and GPT-3.5 Turbo. In this realistic scenario, we simulate how judgments are predicted at the point when a case is presented for a decision in court, using only the information available at that time, such as the facts of the case, statutes, precedents, and arguments. This approach mimics real-world conditions, where decisions must be made without the benefit of hindsight, unlike retrospective analyses often found in previous studies. For transformer models, we experiment with hierarchical transformers and the summarization of judgment facts to optimize input for these models. Our experiments with LLMs reveal that GPT-3.5 Turbo excels in realistic scenarios, demonstrating robust performance in judgment prediction. Furthermore, incorporating additional legal information, such as statutes and precedents, significantly improves the outcome of the prediction task. The LLMs also provide explanations for their predictions. To evaluate the quality of these predictions and explanations, we introduce two human evaluation metrics: Clarity and Linking. Our findings from both automatic and human evaluations indicate that, despite advancements in LLMs, they are yet to achieve expert-level performance in judgment prediction and explanation tasks.
