A Reproducibility Study of Graph-Based Legal Case Retrieval
Gregor Donabauer, Udo Kruschwitz
TL;DR
This work addresses reproducibility in graph-based legal case retrieval by replicating CaseLink on COLIEE 2022/2023 and extending evaluation to COLIEE 2024, while exploring heterogeneous graphs and open LLM preprocessing. The authors provide open artifacts and perform ablations, showing that original $NDCG@5$ results are not consistently reproducible, that COLIEE 2024 yields intermediate performance, and that open LLMs (Llama-3.1) can improve retrieval performance over GPT-3.5 in several cases. They find heterogeneous graphs do not reliably outperform homogeneous graphs, and that an open LLM generally enhances performance, highlighting the importance of reproducibility practices in graph-based legal IR. Overall, the study contributes to methodological rigor in legal IR and offers practical guidance for future graph-based retrieval research by sharing implementations and artifacts.
Abstract
Legal retrieval is a widely studied area in Information Retrieval (IR) and a key task in this domain is retrieving relevant cases based on a given query case, often done by applying language models as encoders to model case similarity. Recently, Tang et al. proposed CaseLink, a novel graph-based method for legal case retrieval, which models both cases and legal charges as nodes in a network, with edges representing relationships such as references and shared semantics. This approach offers a new perspective on the task by capturing higher-order relationships of cases going beyond the stand-alone level of documents. However, while this shift in approaching legal case retrieval is a promising direction in an understudied area of graph-based legal IR, challenges in reproducing novel results have recently been highlighted, with multiple studies reporting difficulties in reproducing previous findings. Thus, in this work we reproduce CaseLink, a graph-based legal case retrieval method, to support future research in this area of IR. In particular, we aim to assess its reliability and generalizability by (i) first reproducing the original study setup and (ii) applying the approach to an additional dataset. We then build upon the original implementations by (iii) evaluating the approach's performance when using a more sophisticated graph data representation and (iv) using an open large language model (LLM) in the pipeline to address limitations that are known to result from using closed models accessed via an API. Our findings aim to improve the understanding of graph-based approaches in legal IR and contribute to improving reproducibility in the field. To achieve this, we share all our implementations and experimental artifacts with the community.
