Table of Contents
Fetching ...

Adaptive Two-Phase Finetuning LLMs for Japanese Legal Text Retrieval

Quang Hoang Trung, Nguyen Van Hoang Phuc, Le Trung Hoang, Quang Huu Hieu, Vo Nguyen Le Duy

TL;DR

This work tackles Japanese legal text retrieval by introducing a native dataset and a two-phase finetuning pipeline inspired by RepLLaMA. Phase 1 Global Contextualization uses BM25+ with in-batch easy negatives to develop broad generalization, while Phase 2 Domain-Specific Deepening focuses on hard negatives drawn from Phase 1 predictions to specialize for complex queries. The approach leverages LLaMA-based dense embeddings with LoRA and quantization, and combines lexical and semantic signals through an ensemble framework, achieving state-of-the-art results on a native Japanese legal corpus and strong performance on MS MARCO. The results demonstrate both domain-specific gains and cross-domain generalizability, with public code and HuggingFace checkpoints enabling replication and extension in real-world retrieval systems.

Abstract

Text Retrieval (TR) involves finding and retrieving text-based content relevant to a user's query from a large repository, with applications in real-world scenarios such as legal document retrieval. While most existing studies focus on English, limited work addresses Japanese contexts. In this paper, we introduce a new dataset specifically designed for Japanese legal contexts and propose a novel two-phase pipeline tailored to this domain. In the first phase, the model learns a broad understanding of global contexts, enhancing its generalization and adaptability to diverse queries. In the second phase, the model is fine-tuned to address complex queries specific to legal scenarios. Extensive experiments are conducted to demonstrate the superior performance of our method, which outperforms existing baselines. Furthermore, our pipeline proves effective in English contexts, surpassing comparable baselines on the MS MARCO dataset. We have made our code publicly available on GitHub, and the model checkpoints are accessible via HuggingFace.

Adaptive Two-Phase Finetuning LLMs for Japanese Legal Text Retrieval

TL;DR

This work tackles Japanese legal text retrieval by introducing a native dataset and a two-phase finetuning pipeline inspired by RepLLaMA. Phase 1 Global Contextualization uses BM25+ with in-batch easy negatives to develop broad generalization, while Phase 2 Domain-Specific Deepening focuses on hard negatives drawn from Phase 1 predictions to specialize for complex queries. The approach leverages LLaMA-based dense embeddings with LoRA and quantization, and combines lexical and semantic signals through an ensemble framework, achieving state-of-the-art results on a native Japanese legal corpus and strong performance on MS MARCO. The results demonstrate both domain-specific gains and cross-domain generalizability, with public code and HuggingFace checkpoints enabling replication and extension in real-world retrieval systems.

Abstract

Text Retrieval (TR) involves finding and retrieving text-based content relevant to a user's query from a large repository, with applications in real-world scenarios such as legal document retrieval. While most existing studies focus on English, limited work addresses Japanese contexts. In this paper, we introduce a new dataset specifically designed for Japanese legal contexts and propose a novel two-phase pipeline tailored to this domain. In the first phase, the model learns a broad understanding of global contexts, enhancing its generalization and adaptability to diverse queries. In the second phase, the model is fine-tuned to address complex queries specific to legal scenarios. Extensive experiments are conducted to demonstrate the superior performance of our method, which outperforms existing baselines. Furthermore, our pipeline proves effective in English contexts, surpassing comparable baselines on the MS MARCO dataset. We have made our code publicly available on GitHub, and the model checkpoints are accessible via HuggingFace.

Paper Structure

This paper contains 18 sections, 8 equations, 2 figures, 6 tables, 3 algorithms.

Figures (2)

  • Figure 1: The overview of the multi-stage retrieval applies a two-phase approach for optimal performance. During the Preprocessing stage, unnecessary elements such as special characters, undesired symbols, and stop words are filtered out, ensuring the input text is cleaner and more accurate before passing through BM25+. This process allows for the retrieval of top-$a_1$ relevant documents related to the query. Starting with Phase 1: Global Contextualization, the model leverages three types of documents: positive (human-labeled documents that are truly relevant to the query), negative (top documents retrieved by BM25+ that appear relevant to the query but are not genuinely positive), and easy (top documents relevant to other queries, which may be irrelevant to the current query, incorporated via in-batch negatives). This diverse input enriches the training process, enabling the model to develop a broad global understanding and enhance generalization. In Phase 2: Domain-Specific Deepening, after the model has been fine-tuned in Phase 1, it continues to fine-tune on top-$a_2$ hard documents specific to the query. These include positive and hard negatives documents (highly relevant documents retrieved by the Phase 1 fine-tuned model but not labeled as positive). This targeted refinement enables the model to focus on documents that are difficult to distinguish as relevant or not to the query, significantly enhancing precision and performance in complex retrieval scenarios.
  • Figure 2: Illustration of the encoding mechanism utilizing Large Language Models (LLMs). The query and document are independently transformed into dense representations through separate encoding pathways, with the hidden state at the <EOS> token serving as the final embedding for each. These embeddings are then employed to compute a similarity score, quantifying the semantic alignment between the query and the document.