Table of Contents
Fetching ...

Pre-training with Large Language Model-based Document Expansion for Dense Passage Retrieval

Guangyuan Ma, Xing Wu, Peng Wang, Zijia Lin, Songlin Hu

TL;DR

This work tackles the challenge of initializing dense passage retrievers in data-scarce or domain-shift scenarios by pre-training with Large Language Model–generated document expansions (queries). It introduces two pre-training paradigms—contrastive pre-training and bottlenecked query generation—coupled with a two-stage curriculum to dramatically reduce online LLM inferences while preserving performance. Across MS-MARCO, TREC-DL, and BEIR, the approach delivers strong zero-shot and out-of-domain improvements, with clear advantages in initialization and domain adaptation. The findings offer a practical path to fast, unsupervised retrieval systems that leverage LLM capabilities at pre-training time rather than during deployment.

Abstract

In this paper, we systematically study the potential of pre-training with Large Language Model(LLM)-based document expansion for dense passage retrieval. Concretely, we leverage the capabilities of LLMs for document expansion, i.e. query generation, and effectively transfer expanded knowledge to retrievers using pre-training strategies tailored for passage retrieval. These strategies include contrastive learning and bottlenecked query generation. Furthermore, we incorporate a curriculum learning strategy to reduce the reliance on LLM inferences. Experimental results demonstrate that pre-training with LLM-based document expansion significantly boosts the retrieval performance on large-scale web-search tasks. Our work shows strong zero-shot and out-of-domain retrieval abilities, making it more widely applicable for retrieval when initializing with no human-labeled data.

Pre-training with Large Language Model-based Document Expansion for Dense Passage Retrieval

TL;DR

This work tackles the challenge of initializing dense passage retrievers in data-scarce or domain-shift scenarios by pre-training with Large Language Model–generated document expansions (queries). It introduces two pre-training paradigms—contrastive pre-training and bottlenecked query generation—coupled with a two-stage curriculum to dramatically reduce online LLM inferences while preserving performance. Across MS-MARCO, TREC-DL, and BEIR, the approach delivers strong zero-shot and out-of-domain improvements, with clear advantages in initialization and domain adaptation. The findings offer a practical path to fast, unsupervised retrieval systems that leverage LLM capabilities at pre-training time rather than during deployment.

Abstract

In this paper, we systematically study the potential of pre-training with Large Language Model(LLM)-based document expansion for dense passage retrieval. Concretely, we leverage the capabilities of LLMs for document expansion, i.e. query generation, and effectively transfer expanded knowledge to retrievers using pre-training strategies tailored for passage retrieval. These strategies include contrastive learning and bottlenecked query generation. Furthermore, we incorporate a curriculum learning strategy to reduce the reliance on LLM inferences. Experimental results demonstrate that pre-training with LLM-based document expansion significantly boosts the retrieval performance on large-scale web-search tasks. Our work shows strong zero-shot and out-of-domain retrieval abilities, making it more widely applicable for retrieval when initializing with no human-labeled data.
Paper Structure (23 sections, 10 equations, 4 figures, 3 tables)

This paper contains 23 sections, 10 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Query Generation prompts for Alpaca-LLaMA and tk-Instruct.
  • Figure 2: Pre-training with LLM-based document expansion for dense passage retrieval. a) We utilize large language models (LLMs) to generate pseudo-queries with zero-shot or few-shot prompts. b) Bottlenecked query generation pre-training appends an auxiliary Transformers decoder to the encoder. Besides the Masked Language Modelling (MLM) loss of the encoder, we connect the encoder-decoder with merely the bottlenecked representation, i.e., the hidden states of [CLS] token, and make the decoder generate whole LLM-expanded queries with the Cross-Entropy (CE) loss. c) Contrastive pre-training pulls together the representations of the passage and LLM-expanded queries and pushes away in-batch negatives. To minimize reliance on LLM expansions, we implement a two-stage curriculum learning strategy. It first utilizes randomly sampled passages to fully initialize the encoders. And then we can use a relatively small amount of LLM-expanded queries in the second phase.
  • Figure 3: Effects of curriculum learning for fine-tuned bottlenecked pre-training with expanded queries generated by Alpaca 13b. The dashed lines are the corresponding baselines from Table \ref{['table_results_main']}.
  • Figure 4: Effects of curriculum learning for zero-shot contrastive pre-training with LLM-expanded queries.