Teaching Dense Retrieval Models to Specialize with Listwise Distillation and LLM Data Augmentation
Manveer Singh Tamber, Suleman Kazi, Vivek Sourabh, Jimmy Lin
TL;DR
This work tackles domain specialization for dense retrieval by showing that standard InfoNCE fine-tuning can hurt performance and proposing a hybrid approach: cross-encoder listwise distillation combined with diverse synthetic queries generated by LLMs. The method distills rich relevance signals from a cross-encoder into retrievers via KL-divergence between retriever and cross-encoder distributions, while synthetic queries broaden training signals beyond human-written data. Results across BEIR and MSMARCO tasks show consistent gains, with synthetic queries rivaling human-written queries in training utility, though the cross-encoder teacher remains a bottleneck. The findings underscore the value of diverse data and teacher-backed guidance for robust, domain-focused dense retrieval, and the authors release code to enable further exploration.
Abstract
While the current state-of-the-art dense retrieval models exhibit strong out-of-domain generalization, they might fail to capture nuanced domain-specific knowledge. In principle, fine-tuning these models for specialized retrieval tasks should yield higher effectiveness than relying on a one-size-fits-all model, but in practice, results can disappoint. We show that standard fine-tuning methods using an InfoNCE loss can unexpectedly degrade effectiveness rather than improve it, even for domain-specific scenarios. This holds true even when applying widely adopted techniques such as hard-negative mining and negative de-noising. To address this, we explore a training strategy that uses listwise distillation from a teacher cross-encoder, leveraging rich relevance signals to fine-tune the retriever. We further explore synthetic query generation using large language models. Through listwise distillation and training with a diverse set of queries ranging from natural user searches and factual claims to keyword-based queries, we achieve consistent effectiveness gains across multiple datasets. Our results also reveal that synthetic queries can rival human-written queries in training utility. However, we also identify limitations, particularly in the effectiveness of cross-encoder teachers as a bottleneck. We release our code and scripts to encourage further research.
