When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets
Orion Weller, Kyle Lo, David Wadden, Dawn Lawrie, Benjamin Van Durme, Arman Cohan, Luca Soldaini
TL;DR
This work presents the first large-scale analysis of LM-based query and document expansion across multiple expansion methods, retrievers, and distribution shifts. It reveals a robust negative correlation: as base retriever performance improves, gains from expansions decline and often become detrimental, with long-query shifts as a notable exception where expansions help. The study shows this pattern holds across architectures and datasets, supported by qualitative error analysis that attributes failures to noise introduced by expansions. The findings provide practical guidance: prefer LM expansions for weaker models or when distribution shifts are pronounced, and avoid expansions for strong, well-tuned rankers due to signal dilution and noise. Limitations include zero-shot evaluation, reliance on commercial LMs, English-only datasets, and substantial compute requirements for replication.
Abstract
Using large language models (LMs) for query or document expansion can improve generalization in information retrieval. However, it is unknown whether these techniques are universally beneficial or only effective in specific settings, such as for particular retrieval models, dataset domains, or query types. To answer this, we conduct the first comprehensive analysis of LM-based expansion. We find that there exists a strong negative correlation between retriever performance and gains from expansion: expansion improves scores for weaker models, but generally harms stronger models. We show this trend holds across a set of eleven expansion techniques, twelve datasets with diverse distribution shifts, and twenty-four retrieval models. Through qualitative error analysis, we hypothesize that although expansions provide extra information (potentially improving recall), they add additional noise that makes it difficult to discern between the top relevant documents (thus introducing false positives). Our results suggest the following recipe: use expansions for weaker models or when the target dataset significantly differs from training corpus in format; otherwise, avoid expansions to keep the relevance signal clear.
