Table of Contents
Fetching ...

SAILER: Structure-aware Pre-trained Language Model for Legal Case Retrieval

Haitao Li, Qingyao Ai, Jia Chen, Qian Dong, Yueyue Wu, Yiqun Liu, Chong Chen, Qi Tian

TL;DR

SAILER introduces a structure-aware pre-trained language model for legal case retrieval that explicitly leverages the intrinsic five-part structure of legal documents by using a deep Fact Encoder and two shallow decoders (Reasoning and Decision). The model optimizes a combination of masked language modeling, reasoning reconstruction, and judgment prediction objectives, yielding dense representations $h_F$ that align Fact with subsequent Reasoning and Decision content. Evaluations on four benchmarks in zero-shot and fine-tuning scenarios show significant improvements over traditional methods, generic PLMs, and retrieval-oriented pre-training baselines, with ablations confirming the importance of both decoders and structure-aware objectives. The work demonstrates the practical value of incorporating document structure into pre-training for long, domain-specific texts and points to future integration with legal knowledge graphs for even stronger retrieval performance.

Abstract

Legal case retrieval, which aims to find relevant cases for a query case, plays a core role in the intelligent legal system. Despite the success that pre-training has achieved in ad-hoc retrieval tasks, effective pre-training strategies for legal case retrieval remain to be explored. Compared with general documents, legal case documents are typically long text sequences with intrinsic logical structures. However, most existing language models have difficulty understanding the long-distance dependencies between different structures. Moreover, in contrast to the general retrieval, the relevance in the legal domain is sensitive to key legal elements. Even subtle differences in key legal elements can significantly affect the judgement of relevance. However, existing pre-trained language models designed for general purposes have not been equipped to handle legal elements. To address these issues, in this paper, we propose SAILER, a new Structure-Aware pre-traIned language model for LEgal case Retrieval. It is highlighted in the following three aspects: (1) SAILER fully utilizes the structural information contained in legal case documents and pays more attention to key legal elements, similar to how legal experts browse legal case documents. (2) SAILER employs an asymmetric encoder-decoder architecture to integrate several different pre-training objectives. In this way, rich semantic information across tasks is encoded into dense vectors. (3) SAILER has powerful discriminative ability, even without any legal annotation data. It can distinguish legal cases with different charges accurately. Extensive experiments over publicly available legal benchmarks demonstrate that our approach can significantly outperform previous state-of-the-art methods in legal case retrieval.

SAILER: Structure-aware Pre-trained Language Model for Legal Case Retrieval

TL;DR

SAILER introduces a structure-aware pre-trained language model for legal case retrieval that explicitly leverages the intrinsic five-part structure of legal documents by using a deep Fact Encoder and two shallow decoders (Reasoning and Decision). The model optimizes a combination of masked language modeling, reasoning reconstruction, and judgment prediction objectives, yielding dense representations that align Fact with subsequent Reasoning and Decision content. Evaluations on four benchmarks in zero-shot and fine-tuning scenarios show significant improvements over traditional methods, generic PLMs, and retrieval-oriented pre-training baselines, with ablations confirming the importance of both decoders and structure-aware objectives. The work demonstrates the practical value of incorporating document structure into pre-training for long, domain-specific texts and points to future integration with legal knowledge graphs for even stronger retrieval performance.

Abstract

Legal case retrieval, which aims to find relevant cases for a query case, plays a core role in the intelligent legal system. Despite the success that pre-training has achieved in ad-hoc retrieval tasks, effective pre-training strategies for legal case retrieval remain to be explored. Compared with general documents, legal case documents are typically long text sequences with intrinsic logical structures. However, most existing language models have difficulty understanding the long-distance dependencies between different structures. Moreover, in contrast to the general retrieval, the relevance in the legal domain is sensitive to key legal elements. Even subtle differences in key legal elements can significantly affect the judgement of relevance. However, existing pre-trained language models designed for general purposes have not been equipped to handle legal elements. To address these issues, in this paper, we propose SAILER, a new Structure-Aware pre-traIned language model for LEgal case Retrieval. It is highlighted in the following three aspects: (1) SAILER fully utilizes the structural information contained in legal case documents and pays more attention to key legal elements, similar to how legal experts browse legal case documents. (2) SAILER employs an asymmetric encoder-decoder architecture to integrate several different pre-training objectives. In this way, rich semantic information across tasks is encoded into dense vectors. (3) SAILER has powerful discriminative ability, even without any legal annotation data. It can distinguish legal cases with different charges accurately. Extensive experiments over publicly available legal benchmarks demonstrate that our approach can significantly outperform previous state-of-the-art methods in legal case retrieval.
Paper Structure (28 sections, 9 equations, 5 figures, 7 tables)

This paper contains 28 sections, 9 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Illustration of the legal case structure. The left case is from the United States (Case Law System) and the right case is from China (Statute Law System). The standard legal case documents can be organized into five parts: Procedure, Fact, Reasoning, Decision, and Tail.
  • Figure 2: The process of writing a legal document. The process of searching for relevant cases occurs after Fact are obtained. There is significant legal knowledge of judges in the Reasoning and Decision sections.
  • Figure 3: The model design for SAILER, which consists of a deep encoder and two shallow decoders. The Reasoning and Decision section are aggressively masked, joined with the Fact embedding to reconstruct the key legal elements and the judgment results.
  • Figure 4: Comparison of attention weight visualization for SEED and SAILER. A darker color means a higher attention.
  • Figure 5: The t-SNE plot of legal cases.