The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes
Nils Reimers, Iryna Gurevych
TL;DR
The paper analyzes how dense low-dimensional retrieval degrades as index size grows, both theoretically and empirically. It proves that false positives increase with index size and decrease with higher dimensionality, then confirms these effects on MS MARCO using a DistilRoBERTa bi-encoder. Results show dense methods outperform BM25 at small to mid indexes but can be outperformed at very large scales, with random-noise experiments illustrating pronounced false positives in low-dimensional dense spaces. The work cautions against extrapolating small-index success to large-scale deployments and highlights the dimensionality trade-off as a key design consideration.
Abstract
Information Retrieval using dense low-dimensional representations recently became popular and showed out-performance to traditional sparse-representations like BM25. However, no previous work investigated how dense representations perform with large index sizes. We show theoretically and empirically that the performance for dense representations decreases quicker than sparse representations for increasing index sizes. In extreme cases, this can even lead to a tipping point where at a certain index size sparse representations outperform dense representations. We show that this behavior is tightly connected to the number of dimensions of the representations: The lower the dimension, the higher the chance for false positives, i.e. returning irrelevant documents.
