Table of Contents
Fetching ...

ScalingNote: Scaling up Retrievers with Large Language Models for Real-World Dense Retrieval

Suyuan Huang, Chao Zhang, Yuanyuan Wu, Haoxin Zhang, Yuan Wang, Maolin Wang, Shaosheng Cao, Tong Xu, Xiangyu Zhao, Zengchang Qin, Yan Gao, Yunhan Bai, Jun Fan, Yao Hu, Enhong Chen

TL;DR

This work proposes ScalingNote, a two-stage method to exploit the scaling potential of LLMs for retrieval while maintaining online query latency, and verifies the scaling law of dense retrieval with LLMs in industrial scenarios, enabling cost-effective scaling of dense retrieval systems.

Abstract

Dense retrieval in most industries employs dual-tower architectures to retrieve query-relevant documents. Due to online deployment requirements, existing real-world dense retrieval systems mainly enhance performance by designing negative sampling strategies, overlooking the advantages of scaling up. Recently, Large Language Models (LLMs) have exhibited superior performance that can be leveraged for scaling up dense retrieval. However, scaling up retrieval models significantly increases online query latency. To address this challenge, we propose ScalingNote, a two-stage method to exploit the scaling potential of LLMs for retrieval while maintaining online query latency. The first stage is training dual towers, both initialized from the same LLM, to unlock the potential of LLMs for dense retrieval. Then, we distill only the query tower using mean squared error loss and cosine similarity to reduce online costs. Through theoretical analysis and comprehensive offline and online experiments, we show the effectiveness and efficiency of ScalingNote. Our two-stage scaling method outperforms end-to-end models and verifies the scaling law of dense retrieval with LLMs in industrial scenarios, enabling cost-effective scaling of dense retrieval systems. Our online method incorporating ScalingNote significantly enhances the relevance between retrieved documents and queries.

ScalingNote: Scaling up Retrievers with Large Language Models for Real-World Dense Retrieval

TL;DR

This work proposes ScalingNote, a two-stage method to exploit the scaling potential of LLMs for retrieval while maintaining online query latency, and verifies the scaling law of dense retrieval with LLMs in industrial scenarios, enabling cost-effective scaling of dense retrieval systems.

Abstract

Dense retrieval in most industries employs dual-tower architectures to retrieve query-relevant documents. Due to online deployment requirements, existing real-world dense retrieval systems mainly enhance performance by designing negative sampling strategies, overlooking the advantages of scaling up. Recently, Large Language Models (LLMs) have exhibited superior performance that can be leveraged for scaling up dense retrieval. However, scaling up retrieval models significantly increases online query latency. To address this challenge, we propose ScalingNote, a two-stage method to exploit the scaling potential of LLMs for retrieval while maintaining online query latency. The first stage is training dual towers, both initialized from the same LLM, to unlock the potential of LLMs for dense retrieval. Then, we distill only the query tower using mean squared error loss and cosine similarity to reduce online costs. Through theoretical analysis and comprehensive offline and online experiments, we show the effectiveness and efficiency of ScalingNote. Our two-stage scaling method outperforms end-to-end models and verifies the scaling law of dense retrieval with LLMs in industrial scenarios, enabling cost-effective scaling of dense retrieval systems. Our online method incorporating ScalingNote significantly enhances the relevance between retrieved documents and queries.

Paper Structure

This paper contains 28 sections, 3 theorems, 32 equations, 5 figures, 9 tables, 1 algorithm.

Key Result

Proposition 3.1

Let $y\in \{0, 1\}$ be the label indicating the relevance of a query-document pair, where $1$ denotes relevance and $0$ denotes irrelevance. Assume $\mathbb{S}_n=\{(q_i,d_i,y_i)\}_{i=1}^{n}$ is the training dataset. Let $\ell$ be the overall loss, which is $L_{\ell}$-Lipschitz in its non-target vari $N(u, \cdot)$ is the $u$-covering number of a function class.

Figures (5)

  • Figure 1: Performance and queries per second (QPS) comparison of scaling strategies for dual-tower dense retrieval. The circle sizes mean the number of parameters in query towers. Our ScalingNote gets high AUC without reducing QPS.
  • Figure 2: The framework of ScalingNote. The first stage is fully scaling the dual-tower using scaled training data, which learns through cross-device contrastive learning and hard negative mining. The second stage is query-based knowledge distillation (QKD), which transfers the scaled query knowledge from the LLM-based query tower to the faster online query tower.
  • Figure 3: The online framework includes two stages: offline document index construction and online query retrieval.
  • Figure 4: Key information about data collection includes: (a) two sources of query-document pairs: user click behaviors and ranking model evaluations. (b) query association construction method. (c) an example of a multi-hop data sample.
  • Figure 5: The scaling laws of the LLM-based dual-tower architecture for real-world dense retrieval on Xiaohongshu. The dots are the practical experimental results. The dashed lines are the fitted curves of the scaling law. The y-axis is the contrastive entropy on the Small validation dataset. (a) The scaling law for model size. (b) The scaling law for data size. (c) The mixed scaling law of model size and data size.

Theorems & Definitions (5)

  • Proposition 3.1
  • Proposition 3.2
  • theorem 1
  • proof : Proof of Proposition \ref{['pro1']}
  • proof : Proof of Proposition \ref{['pro2']}