An Alternative to FLOPS Regularization to Effectively Productionize SPLADE-Doc
Aldo Porco, Dhruv Mehra, Igor Malioutov, Karthik Radhakrishnan, Moniba Keymanesh, Daniel Preoţiuc-Pietro, Sean MacAvaney, Pengxiang Cheng
TL;DR
The paper tackles the latency bottleneck in Learned Sparse Retrieval (LSR) models like SPLADE-Doc caused by high-frequency terms that produce long posting lists. It introduces DF-FLOPS, a cross-term regularization that augments FLOPS by weighting each term according to its document frequency, using $\ell_{DF-FLOPS} = \sum_{t \in V} ( \frac{w_t}{N} \sum_{i = 1}^{N} r_{i,t} )^2$ with $w_t = \mathrm{activ}(\frac{DF_t}{|C|})$, where $DF_t$ is a corpus-wide frequency estimate updated periodically during training. Applied to SPLADE-Doc, DF-FLOPS reduces the prevalence of high-DF tokens and dramatically lowers retrieval latency (roughly 10x faster than FLOPS baselines) while maintaining in-domain effectiveness (small MRR@10 drop) and often improving cross-domain performance on BEIR. The approach yields production-ready sparsity by enabling selective inclusion of salient high-frequency terms, achieving BM25-like latency with competitive effectiveness and showing strong potential for real-world deployment of LSR in search engines. Future work may explore alternative DF estimation strategies and additional SPLADE variants.
Abstract
Learned Sparse Retrieval (LSR) models encode text as weighted term vectors, which need to be sparse to leverage inverted index structures during retrieval. SPLADE, the most popular LSR model, uses FLOPS regularization to encourage vector sparsity during training. However, FLOPS regularization does not ensure sparsity among terms - only within a given query or document. Terms with very high Document Frequencies (DFs) substantially increase latency in production retrieval engines, such as Apache Solr, due to their lengthy posting lists. To address the issue of high DFs, we present a new variant of FLOPS regularization: DF-FLOPS. This new regularization technique penalizes the usage of high-DF terms, thereby shortening posting lists and reducing retrieval latency. Unlike other inference-time sparsification methods, such as stopword removal, DF-FLOPS regularization allows for the selective inclusion of high-frequency terms in cases where the terms are truly salient. We find that DF-FLOPS successfully reduces the prevalence of high-DF terms and lowers retrieval latency (around 10x faster) in a production-grade engine while maintaining effectiveness both in-domain (only a 2.2-point drop in MRR@10) and cross-domain (improved performance in 12 out of 13 tasks on which we tested). With retrieval latencies on par with BM25, this work provides an important step towards making LSR practical for deployment in production-grade search engines.
