Advancing Vietnamese Information Retrieval with Learning Objective and Benchmark
Phu-Vinh Nguyen, Minh-Nam Tran, Long Nguyen, Dien Dinh
TL;DR
This paper addresses the lack of Vietnamese information retrieval benchmarks by introducing the Vietnamese Context Search (VCS) benchmark for retrieval and reranking, alongside a modified InfoNCE-based objective for training Vietnamese text embeddings. It constructs ViMedRetrieve, ViRerank, MNLI-R, and QNLI-R to evaluate both retrieval and reranking capabilities, and analyzes a temperature hyper-parameter $\tau$ across a dual-input, instruction-style training framework with in-batch negatives and curated hard negatives. The authors show that hard-negative training improves reranking, and the proposed loss $L_{ours}$ generally outperforms InfoNCE across tasks, with some exceptions; they also reveal that lower $\tau$ values are generally favorable for performance. Overall, the work provides a valuable benchmark and a practical training approach to advance Vietnamese IR research, with public resources to support reproducibility and broader adoption.
Abstract
With the rapid development of natural language processing, many language models have been invented for multiple tasks. One important task is information retrieval (IR), which requires models to retrieve relevant documents. Despite its importance in many real-life applications, especially in retrieval augmented generation (RAG) systems, this task lacks Vietnamese benchmarks. This situation causes difficulty in assessing and comparing many existing Vietnamese embedding language models on the task and slows down the advancement of Vietnamese natural language processing (NLP) research. In this work, we aim to provide the Vietnamese research community with a new benchmark for information retrieval, which mainly focuses on retrieval and reranking tasks. Furthermore, we also present a new objective function based on the InfoNCE loss function, which is used to train our Vietnamese embedding model. Our function aims to be better than the origin in information retrieval tasks. Finally, we analyze the effect of temperature, a hyper-parameter in both objective functions, on the performance of text embedding models.
