Table of Contents
Fetching ...

Retrieval with Learned Similarities

Bailu Ding, Jiaqi Zhai

TL;DR

This paper tackles retrieval with expressive learned similarities by introducing Mixture-of-Logits (MoL), which represents the similarity between a query and an item as a gated mixture of low-rank embeddings: $\phi(q,x)=\sum_{p=1}^P \pi_p(q,x)\langle f_p(q), g_p(x)\rangle$. It proves MoL is a universal approximator for high-rank similarity matrices and proposes a mutual-information-based load-balancing loss to regularize conditional computation. The authors present exact and approximate top-$K$ retrieval algorithms with tight error bounds, and demonstrate state-of-the-art results across heterogeneous tasks such as sequential recommendation and QA finetuning, achieving up to 66x latency reductions while maintaining recall above 0.99 relative to exact methods. Empirical results show MoL outperforms dense and generative baselines on both recommendation and QA workloads, validating the practicality and scalability of learned similarities and supporting a shift from MIPS-based retrieval to RAILS on GPUs.

Abstract

Retrieval plays a fundamental role in recommendation systems, search, and natural language processing (NLP) by efficiently finding relevant items from a large corpus given a query. Dot products have been widely used as the similarity function in such tasks, enabled by Maximum Inner Product Search (MIPS) algorithms for efficient retrieval. However, state-of-the-art retrieval algorithms have migrated to learned similarities. These advanced approaches encompass multiple query embeddings, complex neural networks, direct item ID decoding via beam search, and hybrid solutions. Unfortunately, we lack efficient solutions for retrieval in these state-of-the-art setups. Our work addresses this gap by investigating efficient retrieval techniques with expressive learned similarity functions. We establish Mixture-of-Logits (MoL) as a universal approximator of similarity functions, demonstrate that MoL's expressiveness can be realized empirically to achieve superior performance on diverse retrieval scenarios, and propose techniques to retrieve the approximate top-k results using MoL with tight error bounds. Through extensive experimentation, we show that MoL, enhanced by our proposed mutual information-based load balancing loss, sets new state-of-the-art results across heterogeneous scenarios, including sequential retrieval models in recommendation systems and finetuning language models for question answering; and our approximate top-$k$ algorithms outperform baselines by up to 66x in latency while achieving >.99 recall rate compared to exact algorithms.

Retrieval with Learned Similarities

TL;DR

This paper tackles retrieval with expressive learned similarities by introducing Mixture-of-Logits (MoL), which represents the similarity between a query and an item as a gated mixture of low-rank embeddings: . It proves MoL is a universal approximator for high-rank similarity matrices and proposes a mutual-information-based load-balancing loss to regularize conditional computation. The authors present exact and approximate top- retrieval algorithms with tight error bounds, and demonstrate state-of-the-art results across heterogeneous tasks such as sequential recommendation and QA finetuning, achieving up to 66x latency reductions while maintaining recall above 0.99 relative to exact methods. Empirical results show MoL outperforms dense and generative baselines on both recommendation and QA workloads, validating the practicality and scalability of learned similarities and supporting a shift from MIPS-based retrieval to RAILS on GPUs.

Abstract

Retrieval plays a fundamental role in recommendation systems, search, and natural language processing (NLP) by efficiently finding relevant items from a large corpus given a query. Dot products have been widely used as the similarity function in such tasks, enabled by Maximum Inner Product Search (MIPS) algorithms for efficient retrieval. However, state-of-the-art retrieval algorithms have migrated to learned similarities. These advanced approaches encompass multiple query embeddings, complex neural networks, direct item ID decoding via beam search, and hybrid solutions. Unfortunately, we lack efficient solutions for retrieval in these state-of-the-art setups. Our work addresses this gap by investigating efficient retrieval techniques with expressive learned similarity functions. We establish Mixture-of-Logits (MoL) as a universal approximator of similarity functions, demonstrate that MoL's expressiveness can be realized empirically to achieve superior performance on diverse retrieval scenarios, and propose techniques to retrieve the approximate top-k results using MoL with tight error bounds. Through extensive experimentation, we show that MoL, enhanced by our proposed mutual information-based load balancing loss, sets new state-of-the-art results across heterogeneous scenarios, including sequential retrieval models in recommendation systems and finetuning language models for question answering; and our approximate top- algorithms outperform baselines by up to 66x in latency while achieving >.99 recall rate compared to exact algorithms.
Paper Structure (43 sections, 5 theorems, 5 equations, 3 figures, 6 tables, 2 algorithms)

This paper contains 43 sections, 5 theorems, 5 equations, 3 figures, 6 tables, 2 algorithms.

Key Result

Theorem 1

MoL decomposition: Let $A$ be a matrix of $n\times m$, where $n\leq m$. There exists $\pi_1, B_1, \pi_2, B_2, \cdots, \pi_p, B_p$ such that $|A-\sum_{p=1}^{P}\pi_p \circ B_i|<\epsilon$, where $\epsilon$ is a small positive number. Here $B_i$ is a matrix of $n\times m$ with rank equal to or less than

Figures (3)

  • Figure 1: Mixture-of-logits (MoL) learned similarity.
  • Figure 2: Illustration of how to apply Mixture-of-logits (MoL) learned similarity to various retrieval scenarios, with a language model finetuning use case (characterized by a single homogeneous feature) shown on the left, and a recommendation use case (characterized by a large number of heterogeneous features) shown on the right. More details can be found in Appendix \ref{['sec:app-exp-emb-parameterization-qa']}.
  • Figure 3: Illustration of how to parameterize the embeddings to adapt Mixture-of-logits (MoL) learned similarity to various retrieval scenarios, with a language model (LM) finetuning use case in question answering (characterized by a single homogeneous feature) shown on the left, and a recommendation systems use case (characterized by a large number of heterogeneous features) shown on the right. For the Question Answering example on the left, $SP_1, \ldots, SP_N$ represents the original SentencePiece sentencepiece_emnlp18 tokens that are inputs to the pre-trained language model LM, e.g., T5 t5_raffel2023exploringlimitstransferlearning. $Q_1, Q_2, \ldots, Q_{P_Q}$ and $X_1, X_2, \ldots, X_{P_X}$ represent the special aggregation tokens we add to the LM tokenizer for pooling information across the sequence. The "Parameterized Pooling" component uses a $D$-dimensional embedding as input to parameterize, at an example-level, how to weight each of the (max_seq_len) encoder outputs for the $P_Q$/$P_X$ MoL component-level embeddings.

Theorems & Definitions (7)

  • Theorem 1
  • Lemma 1
  • Lemma 2
  • Definition 1
  • Definition 2
  • Theorem 2
  • Theorem 3