Improving Vietnamese Legal Document Retrieval using Synthetic Data
Son Pham Tien, Hieu Nguyen Doan, An Nguyen Dai, Sang Dinh Viet
TL;DR
This work tackles Vietnamese legal passage retrieval under data scarcity by generating a large-scale synthetic query dataset from legal passages using the Llama 3 model and applying aspect-guided prompts. The synthetic data is used to pre-train and then fine-tune dense retrievers, specifically bi-encoder and ColBERT, with hard negative mining and the InfoNCE loss, while a Query-as-Context CoT-MAE pre-training step enhances the encoder’s representations. The authors introduce TVPL, a Vietnamese legal benchmark, and demonstrate substantial gains on in-domain benchmarks (TVPL and Legal Zalo 21) with ColBERT achieving top results, as well as competitive out-of-domain performance on a Vietnamese Wiki QA dataset. They also analyze the impact of aspect-guided prompts and show favorable storage-accuracy trade-offs through ColBERT residual compression. The datasets and methods are released publicly to advance Vietnamese language retrieval research and practical legal search applications.
Abstract
In the field of legal information retrieval, effective embedding-based models are essential for accurate question-answering systems. However, the scarcity of large annotated datasets poses a significant challenge, particularly for Vietnamese legal texts. To address this issue, we propose a novel approach that leverages large language models to generate high-quality, diverse synthetic queries for Vietnamese legal passages. This synthetic data is then used to pre-train retrieval models, specifically bi-encoder and ColBERT, which are further fine-tuned using contrastive loss with mined hard negatives. Our experiments demonstrate that these enhancements lead to strong improvement in retrieval accuracy, validating the effectiveness of synthetic data and pre-training techniques in overcoming the limitations posed by the lack of large labeled datasets in the Vietnamese legal domain.
