Semi-Parametric Retrieval via Binary Bag-of-Tokens Index
Jiawei Zhou, Li Dong, Furu Wei, Lei Chen
TL;DR
SiDR introduces a semi-parametric, disentangled bi-encoder retrieval framework that decouples the index from neural parameters, enabling both embedding-based and tokenization-based indexing. By aligning a parametric representation $V_{ heta}(x)$ with a non-parametric bag-of-tokens index $V_{ ext{BoT}}(x)$ through semi-parametric contrastive learning and MLM-inspired alignment, SiDR achieves competitive or superior effectiveness while dramatically reducing indexing costs, exemplified by a drop from 31GB/30 GPU hours to 2GB/1 CPU hour in a Wikipedia-scale corpus. The work proposes flexible search pipelines (SiDR_full, SiDR_beta, and late-parametric reranking) and supports in-training retrieval without index rebuilding, addressing real-time, cost-sensitive, and co-training use cases. Across Wiki21m and BEIR benchmarks, SiDR demonstrates notable gains over BM25 and other baselines, particularly when leveraging the non-parametric index for cost-efficient retrieval, and shows favorable latency characteristics. The results highlight the practical value of semi-parametric, disentangled retrieval for scalable, parameter-agnostic, and data-efficient IR in RAG and exploration scenarios.
Abstract
Information retrieval has transitioned from standalone systems into essential components across broader applications, with indexing efficiency, cost-effectiveness, and freshness becoming increasingly critical yet often overlooked. In this paper, we introduce SemI-parametric Disentangled Retrieval (SiDR), a bi-encoder retrieval framework that decouples retrieval index from neural parameters to enable efficient, low-cost, and parameter-agnostic indexing for emerging use cases. Specifically, in addition to using embeddings as indexes like existing neural retrieval methods, SiDR supports a non-parametric tokenization index for search, achieving BM25-like indexing complexity with significantly better effectiveness. Our comprehensive evaluation across 16 retrieval benchmarks demonstrates that SiDR outperforms both neural and term-based retrieval baselines under the same indexing workload: (i) When using an embedding-based index, SiDR exceeds the performance of conventional neural retrievers while maintaining similar training complexity; (ii) When using a tokenization-based index, SiDR drastically reduces indexing cost and time, matching the complexity of traditional term-based retrieval, while consistently outperforming BM25 on all in-domain datasets; (iii) Additionally, we introduce a late parametric mechanism that matches BM25 index preparation time while outperforming other neural retrieval baselines in effectiveness.
