Adaptive Two-Phase Finetuning LLMs for Japanese Legal Text Retrieval
Quang Hoang Trung, Nguyen Van Hoang Phuc, Le Trung Hoang, Quang Huu Hieu, Vo Nguyen Le Duy
TL;DR
This work tackles Japanese legal text retrieval by introducing a native dataset and a two-phase finetuning pipeline inspired by RepLLaMA. Phase 1 Global Contextualization uses BM25+ with in-batch easy negatives to develop broad generalization, while Phase 2 Domain-Specific Deepening focuses on hard negatives drawn from Phase 1 predictions to specialize for complex queries. The approach leverages LLaMA-based dense embeddings with LoRA and quantization, and combines lexical and semantic signals through an ensemble framework, achieving state-of-the-art results on a native Japanese legal corpus and strong performance on MS MARCO. The results demonstrate both domain-specific gains and cross-domain generalizability, with public code and HuggingFace checkpoints enabling replication and extension in real-world retrieval systems.
Abstract
Text Retrieval (TR) involves finding and retrieving text-based content relevant to a user's query from a large repository, with applications in real-world scenarios such as legal document retrieval. While most existing studies focus on English, limited work addresses Japanese contexts. In this paper, we introduce a new dataset specifically designed for Japanese legal contexts and propose a novel two-phase pipeline tailored to this domain. In the first phase, the model learns a broad understanding of global contexts, enhancing its generalization and adaptability to diverse queries. In the second phase, the model is fine-tuned to address complex queries specific to legal scenarios. Extensive experiments are conducted to demonstrate the superior performance of our method, which outperforms existing baselines. Furthermore, our pipeline proves effective in English contexts, surpassing comparable baselines on the MS MARCO dataset. We have made our code publicly available on GitHub, and the model checkpoints are accessible via HuggingFace.
