Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval
Shengjie Ma, Qi Chu, Jiaxin Mao, Xuhui Jiang, Haozhe Duan, Chong Chen
TL;DR
This work tackles the challenge of reliable and interpretable relevance judgments for legal case retrieval by leveraging a general LLM in a carefully designed, few-shot workflow that mirrors human expert reasoning. It decomposes relevance judgments into Material Facts and Legal Facts, using Adaptive Demo-Matching, Fact Extraction, and Fact Annotation to produce expert-aligned labels that are interpretable. Empirical results on the Chinese LeCaRD dataset show high agreement with human judgments, especially for Legal Facts, and demonstrate that label-only synthetic data can meaningfully boost downstream retrieval models and enable knowledge distillation to smaller LLMs. The approach offers a scalable, data-efficient path to improving legal case retrieval while preserving interpretability, with promising generalization to other legal domains and languages through targeted demonstrations and prompts.
Abstract
Determining which legal cases are relevant to a given query involves navigating lengthy texts and applying nuanced legal reasoning. Traditionally, this task has demanded significant time and domain expertise to identify key Legal Facts and reach sound juridical conclusions. In addition, existing data with legal case similarities often lack interpretability, making it difficult to understand the rationale behind relevance judgments. With the growing capabilities of large language models (LLMs), researchers have begun investigating their potential in this domain. Nonetheless, the method of employing a general large language model for reliable relevance judgments in legal case retrieval remains largely unexplored. To address this gap in research, we propose a novel few-shot approach where LLMs assist in generating expert-aligned interpretable relevance judgments. The proposed approach decomposes the judgment process into several stages, mimicking the workflow of human annotators and allowing for the flexible incorporation of expert reasoning to improve the accuracy of relevance judgments. Importantly, it also ensures interpretable data labeling, providing transparency and clarity in the relevance assessment process. Through a comparison of relevance judgments made by LLMs and human experts, we empirically demonstrate that the proposed approach can yield reliable and valid relevance assessments. Furthermore, we demonstrate that with minimal expert supervision, our approach enables a large language model to acquire case analysis expertise and subsequently transfers this ability to a smaller model via annotation-based knowledge distillation.
