Table of Contents
Fetching ...

On the Theoretical Limitations of Embedding-Based Retrieval

Orion Weller, Michael Boratko, Iftekhar Naim, Jinhyuk Lee

TL;DR

The paper investigates fundamental limits of single-vector embeddings for retrieval by linking embedding dimension to the combinatorial space of top-k results via sign-rank theory. It formalizes the capacity of dense vector representations to preserve relevance patterns and demonstrates, both theoretically and via best-case optimization, that many top-k combinations cannot be represented within practical embedding dimensions. It then introduces the LIMIT dataset as a simple, natural-language stress test showing that state-of-the-art embedding models fail to solve even easy-looking tasks when the top-k combination space is dense. The findings argue for developing more expressive retrievers (e.g., cross-encoders, multi-vector models, or sparse approaches) and for evaluating on datasets that probe broader combinatorial retrieval settings.

Abstract

Vector embeddings have been tasked with an ever-increasing set of retrieval tasks over the years, with a nascent rise in using them for reasoning, instruction-following, coding, and more. These new benchmarks push embeddings to work for any query and any notion of relevance that could be given. While prior works have pointed out theoretical limitations of vector embeddings, there is a common assumption that these difficulties are exclusively due to unrealistic queries, and those that are not can be overcome with better training data and larger models. In this work, we demonstrate that we may encounter these theoretical limitations in realistic settings with extremely simple queries. We connect known results in learning theory, showing that the number of top-k subsets of documents capable of being returned as the result of some query is limited by the dimension of the embedding. We empirically show that this holds true even if we restrict to k=2, and directly optimize on the test set with free parameterized embeddings. We then create a realistic dataset called LIMIT that stress tests models based on these theoretical results, and observe that even state-of-the-art models fail on this dataset despite the simple nature of the task. Our work shows the limits of embedding models under the existing single vector paradigm and calls for future research to develop methods that can resolve this fundamental limitation.

On the Theoretical Limitations of Embedding-Based Retrieval

TL;DR

The paper investigates fundamental limits of single-vector embeddings for retrieval by linking embedding dimension to the combinatorial space of top-k results via sign-rank theory. It formalizes the capacity of dense vector representations to preserve relevance patterns and demonstrates, both theoretically and via best-case optimization, that many top-k combinations cannot be represented within practical embedding dimensions. It then introduces the LIMIT dataset as a simple, natural-language stress test showing that state-of-the-art embedding models fail to solve even easy-looking tasks when the top-k combination space is dense. The findings argue for developing more expressive retrievers (e.g., cross-encoders, multi-vector models, or sparse approaches) and for evaluating on datasets that probe broader combinatorial retrieval settings.

Abstract

Vector embeddings have been tasked with an ever-increasing set of retrieval tasks over the years, with a nascent rise in using them for reasoning, instruction-following, coding, and more. These new benchmarks push embeddings to work for any query and any notion of relevance that could be given. While prior works have pointed out theoretical limitations of vector embeddings, there is a common assumption that these difficulties are exclusively due to unrealistic queries, and those that are not can be overcome with better training data and larger models. In this work, we demonstrate that we may encounter these theoretical limitations in realistic settings with extremely simple queries. We connect known results in learning theory, showing that the number of top-k subsets of documents capable of being returned as the result of some query is limited by the dimension of the embedding. We empirically show that this holds true even if we restrict to k=2, and directly optimize on the test set with free parameterized embeddings. We then create a realistic dataset called LIMIT that stress tests models based on these theoretical results, and observe that even state-of-the-art models fail on this dataset despite the simple nature of the task. Our work shows the limits of embedding models under the existing single vector paradigm and calls for future research to develop methods that can resolve this fundamental limitation.

Paper Structure

This paper contains 40 sections, 2 theorems, 8 equations, 7 figures, 7 tables.

Key Result

proposition 1

For a binary matrix $A\in\{0,1\}^{m \times n}$, we have that $\mathop{\mathrm{rank}}\nolimits_\text{rop} A = \mathop{\mathrm{rank}}\nolimits_\text{rt} A$.

Figures (7)

  • Figure 1: A depiction of the LIMIT dataset creation process, based on theoretical limitations. We test all combinations of relevance for $N$ documents (i.e. in the figure, all combinations of relevance for three documents with two relevant documents per query) and instantiate it using a simple mapping. Despite this simplicity, SoTA MTEB models perform poorly, scoring less than 20 recall@100.
  • Figure 2: The critical-n value where the dimensionality is too small to successfully represent all the top-2 combinations. We plot the trend line as a polynomial function.
  • Figure 3: Scores on the LIMIT task. Despite the simplicity of the task we see that SOTA models struggle. We also see that the dimensionality of the model is a limiting factor and that as the dimension increases, so does performance. Even multi-vector models struggle. Lexical models like BM25 do very well due to their higher dimensionality. Stars indicate models trained with MRL.
  • Figure 4: Scores on the LIMIT small task (N=46) over embedding dimensions. Despite having just 46 documents, model struggle even with recall@10 and cannot solve the task even with recall@20.
  • Figure 5: Training on LIMIT train does not significantly help, indicating the issue is not domain shift. But models can solve it if they overfit to the test set.
  • ...and 2 more figures

Theorems & Definitions (8)

  • definition 1
  • definition 2
  • remark 1
  • proposition 1
  • proof
  • definition 3: Sign Rank
  • proposition 2
  • proof