Table of Contents
Fetching ...

The Wisdom of Many Queries: Complexity-Diversity Principle for Dense Retriever Training

Xincan Feng, Noriki Nishida, Yusuke Sakai, Yuji Matsumoto

TL;DR

This work tackles contradictory findings on query diversity in synthetic training for dense retrieval by introducing quality-diversity (Q-D) metrics and the Complexity-Diversity Principle (CDP), which links query complexity to the value of diversity. It proposes zero-shot multi-query synthesis to generate M diverse queries per document using prompt-based prompts with controlled diversity, guided by Q-D metrics, and demonstrates that diversity yields the most gains on reasoning-intensive, multi-hop tasks. Across 31 datasets and four benchmark families, the approach achieves state-of-the-art performance on multi-hop retrieval while revealing a robust, data-efficient trade-off: high-complexity tasks benefit from diversity (CW $>10$), simple tasks may degrade with excessive diversity (CW $<7$). The study provides actionable guidelines for when to apply diversity, highlights diversity as a regularizer, and discusses cost-efficient configurations, with implications for practical dense retriever training and broader generalization.

Abstract

Prior work reports conflicting results on query diversity in synthetic data generation for dense retrieval. We identify this conflict and design Q-D metrics to quantify diversity's impact, making the problem measurable. Through experiments on 4 benchmark types (31 datasets), we find query diversity especially benefits multi-hop retrieval. Deep analysis on multi-hop data reveals that diversity benefit correlates strongly with query complexity ($r$$\geq$0.95, $p$$<$0.05 in 12/14 conditions), measured by content words (CW). We formalize this as the Complexity-Diversity Principle (CDP): query complexity determines optimal diversity. CDP provides actionable thresholds (CW$>$10: use diversity; CW$<$7: avoid it). Guided by CDP, we propose zero-shot multi-query synthesis for multi-hop tasks, achieving state-of-the-art performance.

The Wisdom of Many Queries: Complexity-Diversity Principle for Dense Retriever Training

TL;DR

This work tackles contradictory findings on query diversity in synthetic training for dense retrieval by introducing quality-diversity (Q-D) metrics and the Complexity-Diversity Principle (CDP), which links query complexity to the value of diversity. It proposes zero-shot multi-query synthesis to generate M diverse queries per document using prompt-based prompts with controlled diversity, guided by Q-D metrics, and demonstrates that diversity yields the most gains on reasoning-intensive, multi-hop tasks. Across 31 datasets and four benchmark families, the approach achieves state-of-the-art performance on multi-hop retrieval while revealing a robust, data-efficient trade-off: high-complexity tasks benefit from diversity (CW ), simple tasks may degrade with excessive diversity (CW ). The study provides actionable guidelines for when to apply diversity, highlights diversity as a regularizer, and discusses cost-efficient configurations, with implications for practical dense retriever training and broader generalization.

Abstract

Prior work reports conflicting results on query diversity in synthetic data generation for dense retrieval. We identify this conflict and design Q-D metrics to quantify diversity's impact, making the problem measurable. Through experiments on 4 benchmark types (31 datasets), we find query diversity especially benefits multi-hop retrieval. Deep analysis on multi-hop data reveals that diversity benefit correlates strongly with query complexity (0.95, 0.05 in 12/14 conditions), measured by content words (CW). We formalize this as the Complexity-Diversity Principle (CDP): query complexity determines optimal diversity. CDP provides actionable thresholds (CW10: use diversity; CW7: avoid it). Guided by CDP, we propose zero-shot multi-query synthesis for multi-hop tasks, achieving state-of-the-art performance.
Paper Structure (58 sections, 2 equations, 12 figures, 19 tables, 1 algorithm)

This paper contains 58 sections, 2 equations, 12 figures, 19 tables, 1 algorithm.

Figures (12)

  • Figure 1: Diverse queries act as regularization (Contriever case study). With single-query training (Q/Doc=1), the retriever overfits to surface features like character names, retrieving passages that merely mention "Ivan." With diverse training queries (Q/Doc=3), the model learns fine-grained semantic matching and correctly retrieves passages about Ivan's actual promise. Example from NovelHopQA novelhopqa2025.
  • Figure 2: Experimental pipeline for testing the Complexity-Diversity Principle. Given a document corpus, we (1) generate $M$ diverse queries per document using zero-shot prompting, (2) measure query quality and diversity using Q-D metrics, and (3) tune diversity level based on target task (dashed arrow shows iteration on sample data). The retriever is then trained with standard contrastive loss. This pipeline enables controlled experiments varying diversity while holding other factors constant.
  • Figure 3: Diverse prompting strategy for multi-hop tasks. The prompt enforces varied query formats (factual, procedural, causal, conditional, keyword, statement, comparison) targeting different information from the document. M denotes Q/Doc.
  • Figure 4: Quality and Diversity metrics as the number of queries per document increases. (a) Quality ($\uparrow$): Dist-Sim and Len-Sim measure similarity to human-annotated queries, where higher values indicate more human-like quality. (b) Diversity ($\downarrow$): CE and Self-BLEU measure query similarity, where lower values indicate higher diversity. Few-shot methods become less diverse (values increase), while our method becomes dramatically more diverse (scores drop from $\sim$0.8 to $\sim$0.1). DRAGON-S is excluded as it uses a teacher model to rerank document order, making quality and diversity metrics not directly comparable. Full data in Appendix Table \ref{['tab:qd_full']}.
  • Figure 5: Query diversity correlates with multi-hop retrieval performance. X-axis: CE (cross-encoder paraphrase ratio), where lower = more diverse. Y-axis: NDCG@10 on Multi-hop benchmark. Our method with the lowest CE achieves the best performance on both retrievers.
  • ...and 7 more figures