Table of Contents
Fetching ...

Semi-Parametric Retrieval via Binary Bag-of-Tokens Index

Jiawei Zhou, Li Dong, Furu Wei, Lei Chen

TL;DR

SiDR introduces a semi-parametric, disentangled bi-encoder retrieval framework that decouples the index from neural parameters, enabling both embedding-based and tokenization-based indexing. By aligning a parametric representation $V_{ heta}(x)$ with a non-parametric bag-of-tokens index $V_{ ext{BoT}}(x)$ through semi-parametric contrastive learning and MLM-inspired alignment, SiDR achieves competitive or superior effectiveness while dramatically reducing indexing costs, exemplified by a drop from 31GB/30 GPU hours to 2GB/1 CPU hour in a Wikipedia-scale corpus. The work proposes flexible search pipelines (SiDR_full, SiDR_beta, and late-parametric reranking) and supports in-training retrieval without index rebuilding, addressing real-time, cost-sensitive, and co-training use cases. Across Wiki21m and BEIR benchmarks, SiDR demonstrates notable gains over BM25 and other baselines, particularly when leveraging the non-parametric index for cost-efficient retrieval, and shows favorable latency characteristics. The results highlight the practical value of semi-parametric, disentangled retrieval for scalable, parameter-agnostic, and data-efficient IR in RAG and exploration scenarios.

Abstract

Information retrieval has transitioned from standalone systems into essential components across broader applications, with indexing efficiency, cost-effectiveness, and freshness becoming increasingly critical yet often overlooked. In this paper, we introduce SemI-parametric Disentangled Retrieval (SiDR), a bi-encoder retrieval framework that decouples retrieval index from neural parameters to enable efficient, low-cost, and parameter-agnostic indexing for emerging use cases. Specifically, in addition to using embeddings as indexes like existing neural retrieval methods, SiDR supports a non-parametric tokenization index for search, achieving BM25-like indexing complexity with significantly better effectiveness. Our comprehensive evaluation across 16 retrieval benchmarks demonstrates that SiDR outperforms both neural and term-based retrieval baselines under the same indexing workload: (i) When using an embedding-based index, SiDR exceeds the performance of conventional neural retrievers while maintaining similar training complexity; (ii) When using a tokenization-based index, SiDR drastically reduces indexing cost and time, matching the complexity of traditional term-based retrieval, while consistently outperforming BM25 on all in-domain datasets; (iii) Additionally, we introduce a late parametric mechanism that matches BM25 index preparation time while outperforming other neural retrieval baselines in effectiveness.

Semi-Parametric Retrieval via Binary Bag-of-Tokens Index

TL;DR

SiDR introduces a semi-parametric, disentangled bi-encoder retrieval framework that decouples the index from neural parameters, enabling both embedding-based and tokenization-based indexing. By aligning a parametric representation with a non-parametric bag-of-tokens index through semi-parametric contrastive learning and MLM-inspired alignment, SiDR achieves competitive or superior effectiveness while dramatically reducing indexing costs, exemplified by a drop from 31GB/30 GPU hours to 2GB/1 CPU hour in a Wikipedia-scale corpus. The work proposes flexible search pipelines (SiDR_full, SiDR_beta, and late-parametric reranking) and supports in-training retrieval without index rebuilding, addressing real-time, cost-sensitive, and co-training use cases. Across Wiki21m and BEIR benchmarks, SiDR demonstrates notable gains over BM25 and other baselines, particularly when leveraging the non-parametric index for cost-efficient retrieval, and shows favorable latency characteristics. The results highlight the practical value of semi-parametric, disentangled retrieval for scalable, parameter-agnostic, and data-efficient IR in RAG and exploration scenarios.

Abstract

Information retrieval has transitioned from standalone systems into essential components across broader applications, with indexing efficiency, cost-effectiveness, and freshness becoming increasingly critical yet often overlooked. In this paper, we introduce SemI-parametric Disentangled Retrieval (SiDR), a bi-encoder retrieval framework that decouples retrieval index from neural parameters to enable efficient, low-cost, and parameter-agnostic indexing for emerging use cases. Specifically, in addition to using embeddings as indexes like existing neural retrieval methods, SiDR supports a non-parametric tokenization index for search, achieving BM25-like indexing complexity with significantly better effectiveness. Our comprehensive evaluation across 16 retrieval benchmarks demonstrates that SiDR outperforms both neural and term-based retrieval baselines under the same indexing workload: (i) When using an embedding-based index, SiDR exceeds the performance of conventional neural retrievers while maintaining similar training complexity; (ii) When using a tokenization-based index, SiDR drastically reduces indexing cost and time, matching the complexity of traditional term-based retrieval, while consistently outperforming BM25 on all in-domain datasets; (iii) Additionally, we introduce a late parametric mechanism that matches BM25 index preparation time while outperforming other neural retrieval baselines in effectiveness.
Paper Structure (54 sections, 15 equations, 2 figures, 8 tables)

This paper contains 54 sections, 15 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Comparison of storage (2 GB vs. 31 GB) and resource costs (1 CPU hour vs. 30 GPU hours) between two indexes.
  • Figure 2: Left: Training frameworks of entangled retriever, disentangled retriever and our proposed semi-parametric disentangled retriever SiDR; Right: Different inference pipelines of SiDR.