Table of Contents
Fetching ...

IL-PCSR: Legal Corpus for Prior Case and Statute Retrieval

Shounak Paul, Dhananjay Ghumare, Pawan Goyal, Saptarshi Ghosh, Ashutosh Modi

TL;DR

IL-PCSR introduces a unified Indian legal corpus for parallel statute and precedent retrieval, addressing the long-standing gap of interdependent retrieval tasks. The authors showcase a comprehensive retrieval framework combining lexical and semantic models, domain-specific event/GNN representations, and a two-stage LLM re-ranking strategy that exploits the mutual information between statutes and precedents. Key findings reveal that ensembles excel as first-stage retrievers, while LLM re-ranking delivers state-of-the-art performance, with cross-task conditioning in Stage-2 yielding additional gains. The work also provides an annotation study to ground relevance judgments in practice and demonstrates the practicality of transfer learning over multi-task training. Overall, IL-PCSR offers a valuable resource and methodological blueprint for joint legal retrieval with potential impact on legal analytics and decision-support systems.

Abstract

Identifying/retrieving relevant statutes and prior cases/precedents for a given legal situation are common tasks exercised by law practitioners. Researchers to date have addressed the two tasks independently, thus developing completely different datasets and models for each task; however, both retrieval tasks are inherently related, e.g., similar cases tend to cite similar statutes (due to similar factual situation). In this paper, we address this gap. We propose IL-PCR (Indian Legal corpus for Prior Case and Statute Retrieval), which is a unique corpus that provides a common testbed for developing models for both the tasks (Statute Retrieval and Precedent Retrieval) that can exploit the dependence between the two. We experiment extensively with several baseline models on the tasks, including lexical models, semantic models and ensemble based on GNNs. Further, to exploit the dependence between the two tasks, we develop an LLM-based re-ranking approach that gives the best performance.

IL-PCSR: Legal Corpus for Prior Case and Statute Retrieval

TL;DR

IL-PCSR introduces a unified Indian legal corpus for parallel statute and precedent retrieval, addressing the long-standing gap of interdependent retrieval tasks. The authors showcase a comprehensive retrieval framework combining lexical and semantic models, domain-specific event/GNN representations, and a two-stage LLM re-ranking strategy that exploits the mutual information between statutes and precedents. Key findings reveal that ensembles excel as first-stage retrievers, while LLM re-ranking delivers state-of-the-art performance, with cross-task conditioning in Stage-2 yielding additional gains. The work also provides an annotation study to ground relevance judgments in practice and demonstrates the practicality of transfer learning over multi-task training. Overall, IL-PCSR offers a valuable resource and methodological blueprint for joint legal retrieval with potential impact on legal analytics and decision-support systems.

Abstract

Identifying/retrieving relevant statutes and prior cases/precedents for a given legal situation are common tasks exercised by law practitioners. Researchers to date have addressed the two tasks independently, thus developing completely different datasets and models for each task; however, both retrieval tasks are inherently related, e.g., similar cases tend to cite similar statutes (due to similar factual situation). In this paper, we address this gap. We propose IL-PCR (Indian Legal corpus for Prior Case and Statute Retrieval), which is a unique corpus that provides a common testbed for developing models for both the tasks (Statute Retrieval and Precedent Retrieval) that can exploit the dependence between the two. We experiment extensively with several baseline models on the tasks, including lexical models, semantic models and ensemble based on GNNs. Further, to exploit the dependence between the two tasks, we develop an LLM-based re-ranking approach that gives the best performance.

Paper Structure

This paper contains 21 sections, 9 equations, 8 figures, 18 tables.

Figures (8)

  • Figure 1: Part of the graph of a case document based on LLM-generated events (input for Event-GNN)
  • Figure 2: Ensemble of lexical and semantic models. $\alpha$ can be tuned either via grid-search or dynamically learned via FFN.
  • Figure 3: Proposed two-stage LLM prompting approach
  • Figure 4: Performance in terms of F1(%) compared to frequency of candidates. On the X-axis, the candidates are sorted from left to right according to frequency and divided into groups (most frequent group-1, most rare group-4).
  • Figure 5: Grid Search F1(%) of the ensemble methods for LSR task. Each figure shows the plot of performance vs. different $\alpha$ values when combining different models with BM25.
  • ...and 3 more figures