Table of Contents
Fetching ...

Conventional Contrastive Learning Often Falls Short: Improving Dense Retrieval with Cross-Encoder Listwise Distillation and Synthetic Data

Manveer Singh Tamber, Suleman Kazi, Vivek Sourabh, Jimmy Lin

TL;DR

The paper tackles the challenge that conventional InfoNCE-based fine-tuning can degrade corpus-specific retrieval performance. It introduces cross-encoder listwise distillation, augmented by diverse synthetic query generation, to provide richer relevance signals and avoid binary labeling limitations. Through extensive experiments on BEIR and MSMARCO across multiple BERT-based retrievers, it shows that combining listwise distillation with contrastive learning yields consistent gains, while diverse synthetic queries substantially boost generalization and can rival human-written queries. A general-purpose model, Distill-(RT5, Gemma), demonstrates strong performance in both in-domain and out-of-domain settings, highlighting the practicality and scalability of synthetic-data pipelines for dense retrieval. Overall, the work presents a pragmatic framework for corpus-specific adaption and broad retrieval improvements, with potential for extension to larger models and multilingual contexts.

Abstract

We investigate improving the retrieval effectiveness of embedding models through the lens of corpus-specific fine-tuning. Prior work has shown that fine-tuning with queries generated using a dataset's retrieval corpus can boost retrieval effectiveness for the dataset. However, we find that surprisingly, fine-tuning using the conventional InfoNCE contrastive loss often reduces effectiveness in state-of-the-art models. To overcome this, we revisit cross-encoder listwise distillation and demonstrate that, unlike using contrastive learning alone, listwise distillation can help more consistently improve retrieval effectiveness across multiple datasets. Additionally, we show that synthesizing more training data using diverse query types (such as claims, keywords, and questions) yields greater effectiveness than using any single query type alone, regardless of the query type used in evaluation. Our findings further indicate that synthetic queries offer comparable utility to human-written queries for training. We use our approach to train an embedding model that achieves state-of-the-art effectiveness among BERT embedding models. We release our model and both query generation and training code to facilitate further research.

Conventional Contrastive Learning Often Falls Short: Improving Dense Retrieval with Cross-Encoder Listwise Distillation and Synthetic Data

TL;DR

The paper tackles the challenge that conventional InfoNCE-based fine-tuning can degrade corpus-specific retrieval performance. It introduces cross-encoder listwise distillation, augmented by diverse synthetic query generation, to provide richer relevance signals and avoid binary labeling limitations. Through extensive experiments on BEIR and MSMARCO across multiple BERT-based retrievers, it shows that combining listwise distillation with contrastive learning yields consistent gains, while diverse synthetic queries substantially boost generalization and can rival human-written queries. A general-purpose model, Distill-(RT5, Gemma), demonstrates strong performance in both in-domain and out-of-domain settings, highlighting the practicality and scalability of synthetic-data pipelines for dense retrieval. Overall, the work presents a pragmatic framework for corpus-specific adaption and broad retrieval improvements, with potential for extension to larger models and multilingual contexts.

Abstract

We investigate improving the retrieval effectiveness of embedding models through the lens of corpus-specific fine-tuning. Prior work has shown that fine-tuning with queries generated using a dataset's retrieval corpus can boost retrieval effectiveness for the dataset. However, we find that surprisingly, fine-tuning using the conventional InfoNCE contrastive loss often reduces effectiveness in state-of-the-art models. To overcome this, we revisit cross-encoder listwise distillation and demonstrate that, unlike using contrastive learning alone, listwise distillation can help more consistently improve retrieval effectiveness across multiple datasets. Additionally, we show that synthesizing more training data using diverse query types (such as claims, keywords, and questions) yields greater effectiveness than using any single query type alone, regardless of the query type used in evaluation. Our findings further indicate that synthetic queries offer comparable utility to human-written queries for training. We use our approach to train an embedding model that achieves state-of-the-art effectiveness among BERT embedding models. We release our model and both query generation and training code to facilitate further research.

Paper Structure

This paper contains 26 sections, 2 equations, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Retrieval effectiveness scores on DL19 and DL20 at different hard negative filtering thresholds during E5-unsupervised fine-tuning on MSMARCO passages and synthetic queries.