Table of Contents
Fetching ...

Improving Vietnamese Legal Document Retrieval using Synthetic Data

Son Pham Tien, Hieu Nguyen Doan, An Nguyen Dai, Sang Dinh Viet

TL;DR

This work tackles Vietnamese legal passage retrieval under data scarcity by generating a large-scale synthetic query dataset from legal passages using the Llama 3 model and applying aspect-guided prompts. The synthetic data is used to pre-train and then fine-tune dense retrievers, specifically bi-encoder and ColBERT, with hard negative mining and the InfoNCE loss, while a Query-as-Context CoT-MAE pre-training step enhances the encoder’s representations. The authors introduce TVPL, a Vietnamese legal benchmark, and demonstrate substantial gains on in-domain benchmarks (TVPL and Legal Zalo 21) with ColBERT achieving top results, as well as competitive out-of-domain performance on a Vietnamese Wiki QA dataset. They also analyze the impact of aspect-guided prompts and show favorable storage-accuracy trade-offs through ColBERT residual compression. The datasets and methods are released publicly to advance Vietnamese language retrieval research and practical legal search applications.

Abstract

In the field of legal information retrieval, effective embedding-based models are essential for accurate question-answering systems. However, the scarcity of large annotated datasets poses a significant challenge, particularly for Vietnamese legal texts. To address this issue, we propose a novel approach that leverages large language models to generate high-quality, diverse synthetic queries for Vietnamese legal passages. This synthetic data is then used to pre-train retrieval models, specifically bi-encoder and ColBERT, which are further fine-tuned using contrastive loss with mined hard negatives. Our experiments demonstrate that these enhancements lead to strong improvement in retrieval accuracy, validating the effectiveness of synthetic data and pre-training techniques in overcoming the limitations posed by the lack of large labeled datasets in the Vietnamese legal domain.

Improving Vietnamese Legal Document Retrieval using Synthetic Data

TL;DR

This work tackles Vietnamese legal passage retrieval under data scarcity by generating a large-scale synthetic query dataset from legal passages using the Llama 3 model and applying aspect-guided prompts. The synthetic data is used to pre-train and then fine-tune dense retrievers, specifically bi-encoder and ColBERT, with hard negative mining and the InfoNCE loss, while a Query-as-Context CoT-MAE pre-training step enhances the encoder’s representations. The authors introduce TVPL, a Vietnamese legal benchmark, and demonstrate substantial gains on in-domain benchmarks (TVPL and Legal Zalo 21) with ColBERT achieving top results, as well as competitive out-of-domain performance on a Vietnamese Wiki QA dataset. They also analyze the impact of aspect-guided prompts and show favorable storage-accuracy trade-offs through ColBERT residual compression. The datasets and methods are released publicly to advance Vietnamese language retrieval research and practical legal search applications.

Abstract

In the field of legal information retrieval, effective embedding-based models are essential for accurate question-answering systems. However, the scarcity of large annotated datasets poses a significant challenge, particularly for Vietnamese legal texts. To address this issue, we propose a novel approach that leverages large language models to generate high-quality, diverse synthetic queries for Vietnamese legal passages. This synthetic data is then used to pre-train retrieval models, specifically bi-encoder and ColBERT, which are further fine-tuned using contrastive loss with mined hard negatives. Our experiments demonstrate that these enhancements lead to strong improvement in retrieval accuracy, validating the effectiveness of synthetic data and pre-training techniques in overcoming the limitations posed by the lack of large labeled datasets in the Vietnamese legal domain.

Paper Structure

This paper contains 21 sections, 1 equation, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Workflow for generating synthetic queries and fine-tuning retrieval models using Vietnamese legal texts.
  • Figure 2: The shortened prompt template we used to generate synthetic queries from legal text passages, with placeholders for input documents and few-shot examples omitted.
  • Figure 3: Top 20 domains by number of queries in synthetic query dataset.