Aligning Web Query Generation with Ranking Objectives via Direct Preference Optimization
João Coelho, Bruno Martins, João Magalhães, Chenyan Xiong
TL;DR
This study tackles the data bottleneck in training neural web retrievers by using Direct Preference Optimization (DPO) to align synthetic query generation with ranking signals, rather than relying on post hoc filtering. The authors generate multiple queries per document, construct ranking-based preference datasets, and fine-tune the generator with the DPO loss to produce high-quality queries aimed at maximizing downstream retrieval effectiveness. They validate the approach on MS MARCO and ClueWeb22, showing that DPO-aligned queries lead to higher ranker relevance and stronger retrieval performance, along with substantially better query retention. The framework reduces dependence on human annotations and demonstrates data-efficient, scalable improvement for dense retrieval in the Web domain, with potential applicability to other synthetic-data tasks.
Abstract
Neural retrieval models excel in Web search, but their training requires substantial amounts of labeled query-document pairs, which are costly to obtain. With the widespread availability of Web document collections like ClueWeb22, synthetic queries generated by large language models offer a scalable alternative. Still, synthetic training queries often vary in quality, which leads to suboptimal downstream retrieval performance. Existing methods typically filter out noisy query-document pairs based on signals from an external re-ranker. In contrast, we propose a framework that leverages Direct Preference Optimization (DPO) to integrate ranking signals into the query generation process, aiming to directly optimize the model towards generating high-quality queries that maximize downstream retrieval effectiveness. Experiments show higher ranker-assessed relevance between query-document pairs after DPO, leading to stronger downstream performance on the MS~MARCO benchmark when compared to baseline models trained with synthetic data.
