Table of Contents
Fetching ...

Optimizing Multi-Stage Language Models for Effective Text Retrieval

Quang Hoang Trung, Le Trung Hoang, Nguyen Van Hoang Phuc

TL;DR

The paper tackles the challenge of domain-specific text retrieval, especially for Japanese legal documents, where sparse retrieval methods fall short. It proposes a two-phase, language-model based retrieval framework with multi-stage training and hard negative mining, plus an ensemble of three models to boost robustness and accuracy. Empirical results on a Japanese legal dataset and the MS MARCO benchmark show state-of-the-art performance, with the LMS variants and the ensemble delivering clear gains over sparse, dense, and generative baselines. The approach offers a scalable, effective solution for complex, multilingual query scenarios with high practical impact for legal information retrieval and beyond.

Abstract

Efficient text retrieval is critical for applications such as legal document analysis, particularly in specialized contexts like Japanese legal systems. Existing retrieval methods often underperform in such domain-specific scenarios, necessitating tailored approaches. In this paper, we introduce a novel two-phase text retrieval pipeline optimized for Japanese legal datasets. Our method leverages advanced language models to achieve state-of-the-art performance, significantly improving retrieval efficiency and accuracy. To further enhance robustness and adaptability, we incorporate an ensemble model that integrates multiple retrieval strategies, resulting in superior outcomes across diverse tasks. Extensive experiments validate the effectiveness of our approach, demonstrating strong performance on both Japanese legal datasets and widely recognized benchmarks like MS-MARCO. Our work establishes new standards for text retrieval in domain-specific and general contexts, providing a comprehensive solution for addressing complex queries in legal and multilingual environments.

Optimizing Multi-Stage Language Models for Effective Text Retrieval

TL;DR

The paper tackles the challenge of domain-specific text retrieval, especially for Japanese legal documents, where sparse retrieval methods fall short. It proposes a two-phase, language-model based retrieval framework with multi-stage training and hard negative mining, plus an ensemble of three models to boost robustness and accuracy. Empirical results on a Japanese legal dataset and the MS MARCO benchmark show state-of-the-art performance, with the LMS variants and the ensemble delivering clear gains over sparse, dense, and generative baselines. The approach offers a scalable, effective solution for complex, multilingual query scenarios with high practical impact for legal information retrieval and beyond.

Abstract

Efficient text retrieval is critical for applications such as legal document analysis, particularly in specialized contexts like Japanese legal systems. Existing retrieval methods often underperform in such domain-specific scenarios, necessitating tailored approaches. In this paper, we introduce a novel two-phase text retrieval pipeline optimized for Japanese legal datasets. Our method leverages advanced language models to achieve state-of-the-art performance, significantly improving retrieval efficiency and accuracy. To further enhance robustness and adaptability, we incorporate an ensemble model that integrates multiple retrieval strategies, resulting in superior outcomes across diverse tasks. Extensive experiments validate the effectiveness of our approach, demonstrating strong performance on both Japanese legal datasets and widely recognized benchmarks like MS-MARCO. Our work establishes new standards for text retrieval in domain-specific and general contexts, providing a comprehensive solution for addressing complex queries in legal and multilingual environments.
Paper Structure (8 sections, 3 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 8 sections, 3 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: The figure presents an overview of the proposed two-phase text retrieval framework. In Phase 1, the model is pretrained using the Masked Language Model (MLM) task to establish a general contextual understanding of the dataset, creating a strong foundation for subsequent training. Phase 2 consists of three stages. In Stage 1, the encoder model, fine-tuned during Phase 1, is used to retrieve the most relevant documents. Among these, truly relevant documents (labeled as positive) are identified based on human annotations, while documents mistakenly considered relevant are labeled as negative. In Stage 2, both positive and negative documents are input into encoder models, such as BERT or RoBERTa, to further refine the model’s ability to differentiate between relevant and irrelevant documents. Unlike traditional approaches, this method replaces sparse retrieval techniques with language models (LMs) to improve performance. In Stage 3, hard negative examples, generated from the fine-tuned model in Stage 2, are used for additional training to enhance the model’s capacity to address more challenging cases. The process concludes with an ensemble step, combining multiple models or techniques to leverage their individual strengths. This integration minimizes errors, improves accuracy, and enhances the stability of retrieval outcomes, resulting in superior overall performance.
  • Figure 2: Illustration of contrastive learning - Similar points (black) are grouped closer, dissimilar points (white) are pushed farther.
  • Figure 3: Illustration of decision boundaries using contrastive loss.
  • Figure 4: Illustration of sentence similarity calculation, dual encoder models process text chunks and legal passages, applying pooling and cosine similarity for semantic matching.
  • Figure 5: Visualization of the grid search process, depicting My_Recall@3 scores across different weight combinations.