Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval

Nandan Thakur; Jianmo Ni; Gustavo Hernández Ábrego; John Wieting; Jimmy Lin; Daniel Cer

Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval

Nandan Thakur, Jianmo Ni, Gustavo Hernández Ábrego, John Wieting, Jimmy Lin, Daniel Cer

TL;DR

Multilingual dense retrieval is hampered by scarce and uneven training data across languages. The authors introduce SWIM-IR, a 28M synthetic multilingual training dataset across 33 languages generated with SAP (Summarize-Then-Ask) using PaLM-2, to fine-tune multilingual dense retrievers (SWIM-X) without human supervision. Empirical results on XOR-Retrieve, MIRACL, and XTREME-UP show SWIM-X can surpass or rival human-supervised baselines in cross-lingual retrieval while remaining competitive in monolingual tasks, at a fraction of annotation cost. The work analyzes SAP effectiveness, data quantity, Indo-European transferability, and provides a cost-performance comparison, offering a scalable pathway for multilingual information access systems.

Abstract

There has been limited success for dense retrieval models in multilingual retrieval, due to uneven and scarce training data available across multiple languages. Synthetic training data generation is promising (e.g., InPars or Promptagator), but has been investigated only for English. Therefore, to study model capabilities across both cross-lingual and monolingual retrieval tasks, we develop SWIM-IR, a synthetic retrieval training dataset containing 33 (high to very-low resource) languages for fine-tuning multilingual dense retrievers without requiring any human supervision. To construct SWIM-IR, we propose SAP (summarize-then-ask prompting), where the large language model (LLM) generates a textual summary prior to the query generation step. SAP assists the LLM in generating informative queries in the target language. Using SWIM-IR, we explore synthetic fine-tuning of multilingual dense retrieval models and evaluate them robustly on three retrieval benchmarks: XOR-Retrieve (cross-lingual), MIRACL (monolingual) and XTREME-UP (cross-lingual). Our models, called SWIM-X, are competitive with human-supervised dense retrieval models, e.g., mContriever-X, finding that SWIM-IR can cheaply substitute for expensive human-labeled retrieval training data. SWIM-IR dataset and SWIM-X models are available at https://github.com/google-research-datasets/SWIM-IR.

Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval

TL;DR

Abstract

Paper Structure (33 sections, 10 figures, 8 tables)

This paper contains 33 sections, 10 figures, 8 tables.

Introduction
SWIM-IR Dataset Overview
SAP Design Formulation
SWIM-IR Dataset Construction
Human Validation & Content Filtration
Experiments
Datasets and Metrics
Experimental Methods
Training Methodology
Experimental Results
Effectiveness of Summarization in SAP
How much Synthetic Data to Generate?
Indo-European Language Transferability
Ablation Studies
Cost Comparison
...and 18 more sections

Figures (10)

Figure 1: Summary of the quantitative results across three multilingual retrieval benchmarks evaluated in our work. SWIM-X is fine-tuned on SWIM-IR (PaLM 2 generated synthetic training data) without any human supervision. All scores are macro-averaged.
Figure 2: An illustration of SAP ( Summarize-then-Ask Prompting) versus standard prompting for English query generation on English Wikipedia. SAP assists the LLM in improving the query generation quality (orange box) by identifying the relevant sections of the input passage (highlighted in red) via the extractive summarization (yellow box) as an intermediate reasoning step.
Figure 3: An illustration of the cross-lingual SWIM-IR dataset construction procedure. Steps are as follows: (1) Sample N passages from the English Wikipedia using stratified sampling for each language out of the L languages; (2) Feed a sampled passage along with the few-shot exemplars to the LLM with SAP; (3 & 4) Parse the LLM output to receive the synthetic query in the target language (above in Bengali); (5) Fine-tune a multilingual dense retriever model (SWIM-X) with training pairs combined for all languages, i.e., N$\times$L pairs.
Figure 4: (Left) SAP ( Summarize-then-Ask Prompting) (green) versus standard prompting (red) for various PaLM 2 model sizes. (Right) Varying K-shot prompt exemplars. SWIM-X is fine-tuned on 500K SWIM-IR training pairs and evaluated on XOR-Retrieve.
Figure 5: Heatmap showing MRR@10 denoting language-based transfer ability of SWIM-X (120K) across Indo-European languages available in XTREME-UP ruder2023xtremeup. (ALL) denotes SWIM-X fine-tuned on all XTREME-UP languages.
...and 5 more figures

Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval

TL;DR

Abstract

Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (10)