Table of Contents
Fetching ...

When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets

Orion Weller, Kyle Lo, David Wadden, Dawn Lawrie, Benjamin Van Durme, Arman Cohan, Luca Soldaini

TL;DR

This work presents the first large-scale analysis of LM-based query and document expansion across multiple expansion methods, retrievers, and distribution shifts. It reveals a robust negative correlation: as base retriever performance improves, gains from expansions decline and often become detrimental, with long-query shifts as a notable exception where expansions help. The study shows this pattern holds across architectures and datasets, supported by qualitative error analysis that attributes failures to noise introduced by expansions. The findings provide practical guidance: prefer LM expansions for weaker models or when distribution shifts are pronounced, and avoid expansions for strong, well-tuned rankers due to signal dilution and noise. Limitations include zero-shot evaluation, reliance on commercial LMs, English-only datasets, and substantial compute requirements for replication.

Abstract

Using large language models (LMs) for query or document expansion can improve generalization in information retrieval. However, it is unknown whether these techniques are universally beneficial or only effective in specific settings, such as for particular retrieval models, dataset domains, or query types. To answer this, we conduct the first comprehensive analysis of LM-based expansion. We find that there exists a strong negative correlation between retriever performance and gains from expansion: expansion improves scores for weaker models, but generally harms stronger models. We show this trend holds across a set of eleven expansion techniques, twelve datasets with diverse distribution shifts, and twenty-four retrieval models. Through qualitative error analysis, we hypothesize that although expansions provide extra information (potentially improving recall), they add additional noise that makes it difficult to discern between the top relevant documents (thus introducing false positives). Our results suggest the following recipe: use expansions for weaker models or when the target dataset significantly differs from training corpus in format; otherwise, avoid expansions to keep the relevance signal clear.

When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets

TL;DR

This work presents the first large-scale analysis of LM-based query and document expansion across multiple expansion methods, retrievers, and distribution shifts. It reveals a robust negative correlation: as base retriever performance improves, gains from expansions decline and often become detrimental, with long-query shifts as a notable exception where expansions help. The study shows this pattern holds across architectures and datasets, supported by qualitative error analysis that attributes failures to noise introduced by expansions. The findings provide practical guidance: prefer LM expansions for weaker models or when distribution shifts are pronounced, and avoid expansions for strong, well-tuned rankers due to signal dilution and noise. Limitations include zero-shot evaluation, reliance on commercial LMs, English-only datasets, and substantial compute requirements for replication.

Abstract

Using large language models (LMs) for query or document expansion can improve generalization in information retrieval. However, it is unknown whether these techniques are universally beneficial or only effective in specific settings, such as for particular retrieval models, dataset domains, or query types. To answer this, we conduct the first comprehensive analysis of LM-based expansion. We find that there exists a strong negative correlation between retriever performance and gains from expansion: expansion improves scores for weaker models, but generally harms stronger models. We show this trend holds across a set of eleven expansion techniques, twelve datasets with diverse distribution shifts, and twenty-four retrieval models. Through qualitative error analysis, we hypothesize that although expansions provide extra information (potentially improving recall), they add additional noise that makes it difficult to discern between the top relevant documents (thus introducing false positives). Our results suggest the following recipe: use expansions for weaker models or when the target dataset significantly differs from training corpus in format; otherwise, avoid expansions to keep the relevance signal clear.
Paper Structure (54 sections, 6 figures, 10 tables)

This paper contains 54 sections, 6 figures, 10 tables.

Figures (6)

  • Figure 1: LM-based query and document expansion methods typically improve performance when used with weaker models, but not for stronger models. More accurate models generally lose relevance signal when expansions are provided. Each point is a value in \ref{['tab:models']}.
  • Figure 2: Effect of expansion over twelve datasets. For each dataset, markers show base performance for models, while the boxplot indicates the range of changes in scores for document and/or query expansion. Across all datasets and models, we note a consistent trend: models with lower base performance benefit from expansion; higher performing rankers generally suffer when expansion techniques are used.
  • Figure 3: An example of expansions obscuring the relevance signal. The non-relevant document in red (× ) was ranked higher than the relevant blue (✓) document due to the phrase "Home Equity Line of Credit" being added to the query. The left side shows the original query and documents while the right side shows the ranking.
  • Figure 4: The change in rank for relevant documents in the top 10 when using expansions. Negative values indicate lower ranks (e.g. -5 indicates that the rank of the relevant document went down 5 when using expansions). We see that expansions cause relevant documents to be ranked lower. \ref{['fig:position_change_app']} in the Appendix shows other datasets with similar results.
  • Figure 5: Model scale does not explain negative effect of LM-based expansions. While larger MonoT5 models perform worse, all E5 model sizes are equally impacted
  • ...and 1 more figures