Table of Contents
Fetching ...

Constructing Tree-based Index for Efficient and Effective Dense Retrieval

Haitao Li, Qingyao Ai, Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Zheng Liu, Zhao Cao

TL;DR

Dense retrieval faces indexing bottlenecks that hinder practical deployment. JTR proposes end-to-end joint optimization of a trainable tree-based index and a query encoder using a unified contrastive loss and tree-aware negative sampling, complemented by overlapped clustering to relax mutually exclusive partitions. The approach enforces a maximum heap property to support efficient beam search and demonstrates strong effectiveness with sub-linear retrieval time on MS MARCO against strong ANN baselines. These results indicate a practical path to balancing retrieval quality and latency in neural first-stage retrieval systems, with potential extensions to memory-aware indexing such as joint PQ/tree optimization.

Abstract

Recent studies have shown that Dense Retrieval (DR) techniques can significantly improve the performance of first-stage retrieval in IR systems. Despite its empirical effectiveness, the application of DR is still limited. In contrast to statistic retrieval models that rely on highly efficient inverted index solutions, DR models build dense embeddings that are difficult to be pre-processed with most existing search indexing systems. To avoid the expensive cost of brute-force search, the Approximate Nearest Neighbor (ANN) algorithm and corresponding indexes are widely applied to speed up the inference process of DR models. Unfortunately, while ANN can improve the efficiency of DR models, it usually comes with a significant price on retrieval performance. To solve this issue, we propose JTR, which stands for Joint optimization of TRee-based index and query encoding. Specifically, we design a new unified contrastive learning loss to train tree-based index and query encoder in an end-to-end manner. The tree-based negative sampling strategy is applied to make the tree have the maximum heap property, which supports the effectiveness of beam search well. Moreover, we treat the cluster assignment as an optimization problem to update the tree-based index that allows overlapped clustering. We evaluate JTR on numerous popular retrieval benchmarks. Experimental results show that JTR achieves better retrieval performance while retaining high system efficiency compared with widely-adopted baselines. It provides a potential solution to balance efficiency and effectiveness in neural retrieval system designs.

Constructing Tree-based Index for Efficient and Effective Dense Retrieval

TL;DR

Dense retrieval faces indexing bottlenecks that hinder practical deployment. JTR proposes end-to-end joint optimization of a trainable tree-based index and a query encoder using a unified contrastive loss and tree-aware negative sampling, complemented by overlapped clustering to relax mutually exclusive partitions. The approach enforces a maximum heap property to support efficient beam search and demonstrates strong effectiveness with sub-linear retrieval time on MS MARCO against strong ANN baselines. These results indicate a practical path to balancing retrieval quality and latency in neural first-stage retrieval systems, with potential extensions to memory-aware indexing such as joint PQ/tree optimization.

Abstract

Recent studies have shown that Dense Retrieval (DR) techniques can significantly improve the performance of first-stage retrieval in IR systems. Despite its empirical effectiveness, the application of DR is still limited. In contrast to statistic retrieval models that rely on highly efficient inverted index solutions, DR models build dense embeddings that are difficult to be pre-processed with most existing search indexing systems. To avoid the expensive cost of brute-force search, the Approximate Nearest Neighbor (ANN) algorithm and corresponding indexes are widely applied to speed up the inference process of DR models. Unfortunately, while ANN can improve the efficiency of DR models, it usually comes with a significant price on retrieval performance. To solve this issue, we propose JTR, which stands for Joint optimization of TRee-based index and query encoding. Specifically, we design a new unified contrastive learning loss to train tree-based index and query encoder in an end-to-end manner. The tree-based negative sampling strategy is applied to make the tree have the maximum heap property, which supports the effectiveness of beam search well. Moreover, we treat the cluster assignment as an optimization problem to update the tree-based index that allows overlapped clustering. We evaluate JTR on numerous popular retrieval benchmarks. Experimental results show that JTR achieves better retrieval performance while retaining high system efficiency compared with widely-adopted baselines. It provides a potential solution to balance efficiency and effectiveness in neural retrieval system designs.
Paper Structure (24 sections, 16 equations, 10 figures, 3 tables, 1 algorithm)

This paper contains 24 sections, 16 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: Illustration of the JTR tree structures. The integer represents the sequence number of the node. In this case, The tree has a depth of 3, number of clusters 4, branch balance factor $\beta$ = 2, and leaf balance factor $\gamma$ = 4. The beam size $b$ is set to 2.
  • Figure 2: Initialization of the tree structure. The integer in nodes indicates the number of documents the node contains. In this case, the total number of documents is 10, the branch balance factor $\beta$ = 2, and the leaf balance factor $\gamma$ = 5. If the node contains more documents than $\gamma$, then k-means will be performed on the document embedding in the node until all nodes contain less than $\gamma$ documents. The embedding of each node is initialized as the cluster centroid embedding.
  • Figure 3: Comparison of the workflow of JTR and existing methods. The solid arrows indicate that the gradient propagates backward, while the dashed arrows indicate that the gradient does not propagate.
  • Figure 4: The sampling process of JTR. The integer represents the sequence number of the node. We select the brother nodes of positive samples as negative samples.
  • Figure 5: Illustration of the optimized overlapped cluster. In this case, there are 3 queries, 4 leaf nodes, and 5 documents. The $q_i \backslash l_i \backslash d_i$ represent the i-th query$\backslash$leaf node$\backslash$document respectively. We set the number of overlapped clustering $\lambda=2$. The values in the red boxes are identified by the $\textit{Proj}(.)$ function. In practice, if a document has the same value for two leaves in C*, the $\textit{Proj}(\cdot)$ function prefers to keep the document in its original leaf.
  • ...and 5 more figures