Table of Contents
Fetching ...

The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes

Nils Reimers, Iryna Gurevych

TL;DR

The paper analyzes how dense low-dimensional retrieval degrades as index size grows, both theoretically and empirically. It proves that false positives increase with index size and decrease with higher dimensionality, then confirms these effects on MS MARCO using a DistilRoBERTa bi-encoder. Results show dense methods outperform BM25 at small to mid indexes but can be outperformed at very large scales, with random-noise experiments illustrating pronounced false positives in low-dimensional dense spaces. The work cautions against extrapolating small-index success to large-scale deployments and highlights the dimensionality trade-off as a key design consideration.

Abstract

Information Retrieval using dense low-dimensional representations recently became popular and showed out-performance to traditional sparse-representations like BM25. However, no previous work investigated how dense representations perform with large index sizes. We show theoretically and empirically that the performance for dense representations decreases quicker than sparse representations for increasing index sizes. In extreme cases, this can even lead to a tipping point where at a certain index size sparse representations outperform dense representations. We show that this behavior is tightly connected to the number of dimensions of the representations: The lower the dimension, the higher the chance for false positives, i.e. returning irrelevant documents.

The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes

TL;DR

The paper analyzes how dense low-dimensional retrieval degrades as index size grows, both theoretically and empirically. It proves that false positives increase with index size and decrease with higher dimensionality, then confirms these effects on MS MARCO using a DistilRoBERTa bi-encoder. Results show dense methods outperform BM25 at small to mid indexes but can be outperformed at very large scales, with random-noise experiments illustrating pronounced false positives in low-dimensional dense spaces. The work cautions against extrapolating small-index success to large-scale deployments and highlights the dimensionality trade-off as a key design consideration.

Abstract

Information Retrieval using dense low-dimensional representations recently became popular and showed out-performance to traditional sparse-representations like BM25. However, no previous work investigated how dense representations perform with large index sizes. We show theoretically and empirically that the performance for dense representations decreases quicker than sparse representations for increasing index sizes. In extreme cases, this can even lead to a tipping point where at a certain index size sparse representations outperform dense representations. We show that this behavior is tightly connected to the number of dimensions of the representations: The lower the dimension, the higher the chance for false positives, i.e. returning irrelevant documents.

Paper Structure

This paper contains 11 sections, 7 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Plot of queries (blue), the relevant document (green) and representations from randomly generated strings (red). Dimensionality reduction via UMAP mcinnes2018umap-software. Model with hard negatives, 768 dimensions.