Table of Contents
Fetching ...

SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieval

Yang Bai, Xiaoguang Li, Gang Wang, Chaoliang Zhang, Lifeng Shang, Jun Xu, Zhaowei Wang, Fangshan Wang, Qun Liu

TL;DR

SparTerm presents a direct approach to learning sparse, term-based representations in the full vocabulary by coupling an importance predictor with a gating controller, enabling both term weighting and expansion within a single framework. By leveraging PLM-derived context, it achieves strong retrieval performance on MSMARCO, notably surpassing existing sparse methods and approaching or beating some dense baselines in top-ranked results. The work provides substantial evidence that transferring deep PLM knowledge into sparse BoW-like representations is viable and beneficial for fast, interpretable first-stage retrieval, with detailed analyses of weighting and expansion mechanisms. This offers practical implications for scalable IR systems requiring efficient lexical matching with semantic sensitivity.

Abstract

Term-based sparse representations dominate the first-stage text retrieval in industrial applications, due to its advantage in efficiency, interpretability, and exact term matching. In this paper, we study the problem of transferring the deep knowledge of the pre-trained language model (PLM) to Term-based Sparse representations, aiming to improve the representation capacity of bag-of-words(BoW) method for semantic-level matching, while still keeping its advantages. Specifically, we propose a novel framework SparTerm to directly learn sparse text representations in the full vocabulary space. The proposed SparTerm comprises an importance predictor to predict the importance for each term in the vocabulary, and a gating controller to control the term activation. These two modules cooperatively ensure the sparsity and flexibility of the final text representation, which unifies the term-weighting and expansion in the same framework. Evaluated on MSMARCO dataset, SparTerm significantly outperforms traditional sparse methods and achieves state of the art ranking performance among all the PLM-based sparse models.

SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieval

TL;DR

SparTerm presents a direct approach to learning sparse, term-based representations in the full vocabulary by coupling an importance predictor with a gating controller, enabling both term weighting and expansion within a single framework. By leveraging PLM-derived context, it achieves strong retrieval performance on MSMARCO, notably surpassing existing sparse methods and approaching or beating some dense baselines in top-ranked results. The work provides substantial evidence that transferring deep PLM knowledge into sparse BoW-like representations is viable and beneficial for fast, interpretable first-stage retrieval, with detailed analyses of weighting and expansion mechanisms. This offers practical implications for scalable IR systems requiring efficient lexical matching with semantic sensitivity.

Abstract

Term-based sparse representations dominate the first-stage text retrieval in industrial applications, due to its advantage in efficiency, interpretability, and exact term matching. In this paper, we study the problem of transferring the deep knowledge of the pre-trained language model (PLM) to Term-based Sparse representations, aiming to improve the representation capacity of bag-of-words(BoW) method for semantic-level matching, while still keeping its advantages. Specifically, we propose a novel framework SparTerm to directly learn sparse text representations in the full vocabulary space. The proposed SparTerm comprises an importance predictor to predict the importance for each term in the vocabulary, and a gating controller to control the term activation. These two modules cooperatively ensure the sparsity and flexibility of the final text representation, which unifies the term-weighting and expansion in the same framework. Evaluated on MSMARCO dataset, SparTerm significantly outperforms traditional sparse methods and achieves state of the art ranking performance among all the PLM-based sparse models.

Paper Structure

This paper contains 21 sections, 9 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The comparison between BoW and SparTerm representation. The depth of the color represents the term weights, deeper is higher. Compared with BoW, SparTerm is able to figure out the semantically important terms and expand some terms not appearing in the passage but very semantically relevant, even the terms in the target query such as "sign".
  • Figure 2: Model Architecture of SparTerm. Our overall architecture contains an importance predictor and a gating controller. The importance predictor generates a dense importance distribution with the dimension of vocabulary size, while the gating controller outputs a sparse and binary gating vector to control term activation for the final representation. These two modules cooperatively ensure the sparsity and flexibility of the final representation.
  • Figure 3: Term weightings of different passages weighted by DeepCT and SparTerm, and the expanded terms with their probabilities (before the binarization) predicted by SparTerm. The depth of the color represents the term weights, deeper is higher.
  • Figure 4: The Top 5 contributing words to the expanded words of the second case in Figure \ref{['heat']}. The X-axis are the words in the passage and Y-axis represents logit.