DUQGen: Effective Unsupervised Domain Adaptation of Neural Rankers by Diversifying Synthetic Query Generation

Ramraj Chandradevan; Kaustubh D. Dhole; Eugene Agichtein

DUQGen: Effective Unsupervised Domain Adaptation of Neural Rankers by Diversifying Synthetic Query Generation

Ramraj Chandradevan, Kaustubh D. Dhole, Eugene Agichtein

TL;DR

DUQGen addresses unsupervised domain adaptation for neural rankers under target-domain shift by learning from representative and diverse synthetic data. It introduces a pipeline that clusters the target document collection into $K$ groups, samples $N$ documents with probabilistic, diversity-aware selection, and generates in-domain queries via prompting an LLM, followed by hard negative mining and fine-tuning. Empirically, DUQGen yields consistent improvements over zero-shot baselines and strong unsupervised precedents on the BEIR benchmark (16/18 datasets on average around $4\%$ improvement) using thousands rather than millions of synthetic examples. The work demonstrates data-efficient adaptation, ablation-driven insight into clustering and query-generation choices, and provides code and models to facilitate practical adoption.

Abstract

State-of-the-art neural rankers pre-trained on large task-specific training data such as MS-MARCO, have been shown to exhibit strong performance on various ranking tasks without domain adaptation, also called zero-shot. However, zero-shot neural ranking may be sub-optimal, as it does not take advantage of the target domain information. Unfortunately, acquiring sufficiently large and high quality target training data to improve a modern neural ranker can be costly and time-consuming. To address this problem, we propose a new approach to unsupervised domain adaptation for ranking, DUQGen, which addresses a critical gap in prior literature, namely how to automatically generate both effective and diverse synthetic training data to fine tune a modern neural ranker for a new domain. Specifically, DUQGen produces a more effective representation of the target domain by identifying clusters of similar documents; and generates a more diverse training dataset by probabilistic sampling over the resulting document clusters. Our extensive experiments, over the standard BEIR collection, demonstrate that DUQGen consistently outperforms all zero-shot baselines and substantially outperforms the SOTA baselines on 16 out of 18 datasets, for an average of 4% relative improvement across all datasets. We complement our results with a thorough analysis for more in-depth understanding of the proposed method's performance and to identify promising areas for further improvements.

DUQGen: Effective Unsupervised Domain Adaptation of Neural Rankers by Diversifying Synthetic Query Generation

TL;DR

groups, samples

documents with probabilistic, diversity-aware selection, and generates in-domain queries via prompting an LLM, followed by hard negative mining and fine-tuning. Empirically, DUQGen yields consistent improvements over zero-shot baselines and strong unsupervised precedents on the BEIR benchmark (16/18 datasets on average around

improvement) using thousands rather than millions of synthetic examples. The work demonstrates data-efficient adaptation, ablation-driven insight into clustering and query-generation choices, and provides code and models to facilitate practical adoption.

Abstract

Paper Structure (35 sections, 6 equations, 4 figures, 6 tables)

This paper contains 35 sections, 6 equations, 4 figures, 6 tables.

Introduction
Related Work
Neural Rankers
Unsupervised Domain Adaptation for Neural Rankers
Synthetic IR Data Generation
Methodology
Domain Document Selection
Collection Document Clustering
Probabilistic Document Sampling
Diversified Document Selection
Synthetic Query Generation
Negative Pairs Mining
Fine-tuning with our Synthetic Data
Experiments
Datasets and Metrics
...and 20 more sections

Figures (4)

Figure 1: DUQGen: an unsupervised domain-adaptation framework for neural ranking.
Figure 2: Prompt template with in-context examples for synthetic query generation for the SCIDOCS dataset.
Figure 3: Example queries generated by DUQGen on (a) Quora and (b) TREC-Covid datasets. Pr denotes the $Pr(D_i|cluster_k)$ where $D_i$ and $cluster_k$ refer to $i^{th}$ document and $k^{th}$ cluster.
Figure 4: Example prompts used for (a) NQ and (b) FiQA dataset.

DUQGen: Effective Unsupervised Domain Adaptation of Neural Rankers by Diversifying Synthetic Query Generation

TL;DR

Abstract

DUQGen: Effective Unsupervised Domain Adaptation of Neural Rankers by Diversifying Synthetic Query Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)