Table of Contents
Fetching ...

Leveraging Decoder Architectures for Learned Sparse Retrieval

Jingfen Qiao, Thong Nguyen, Evangelos Kanoulas, Andrew Yates

TL;DR

The paper studies learned sparse retrieval (LSR) across encoder-only, decoder-only, and encoder-decoder transformers to generate lexical sparse representations for first-stage retrieval. It compares sparse representation heads (MLP and MLM, with single-token and multi-token variants) and introduces a multi-tokens decoding scheme to enable aggregate term expansion from input tokens. Results show zero-shot LLMs struggle with term expansion and noise; decoder-only LSR requires very large parameters to approach encoder performance. Encoder-decoder backbones with multi-tokens decoding deliver the strongest effectiveness, outperforming encoder-only and decoder-only under comparable training and distillation settings, while scaling and training signals such as MarginMSE distillation and FLOPs regularization play architecture-dependent roles. Overall, the work highlights the importance of backbone choice and targeted training strategies for practical, scalable sparse retrieval using inverted indexes.

Abstract

Learned Sparse Retrieval (LSR) has traditionally focused on small-scale encoder-only transformer architectures. With the advent of large-scale pre-trained language models, their capability to generate sparse representations for retrieval tasks across different transformer-based architectures, including encoder-only, decoder-only, and encoder-decoder models, remains largely unexplored. This study investigates the effectiveness of LSR across these architectures, exploring various sparse representation heads and model scales. Our results highlight the limitations of using large language models to create effective sparse representations in zero-shot settings, identifying challenges such as inappropriate term expansions and reduced performance due to the lack of expansion. We find that the encoder-decoder architecture with multi-tokens decoding approach achieves the best performance among the three backbones. While the decoder-only model performs worse than the encoder-only model, it demonstrates the potential to outperform when scaled to a high number of parameters.

Leveraging Decoder Architectures for Learned Sparse Retrieval

TL;DR

The paper studies learned sparse retrieval (LSR) across encoder-only, decoder-only, and encoder-decoder transformers to generate lexical sparse representations for first-stage retrieval. It compares sparse representation heads (MLP and MLM, with single-token and multi-token variants) and introduces a multi-tokens decoding scheme to enable aggregate term expansion from input tokens. Results show zero-shot LLMs struggle with term expansion and noise; decoder-only LSR requires very large parameters to approach encoder performance. Encoder-decoder backbones with multi-tokens decoding deliver the strongest effectiveness, outperforming encoder-only and decoder-only under comparable training and distillation settings, while scaling and training signals such as MarginMSE distillation and FLOPs regularization play architecture-dependent roles. Overall, the work highlights the importance of backbone choice and targeted training strategies for practical, scalable sparse retrieval using inverted indexes.

Abstract

Learned Sparse Retrieval (LSR) has traditionally focused on small-scale encoder-only transformer architectures. With the advent of large-scale pre-trained language models, their capability to generate sparse representations for retrieval tasks across different transformer-based architectures, including encoder-only, decoder-only, and encoder-decoder models, remains largely unexplored. This study investigates the effectiveness of LSR across these architectures, exploring various sparse representation heads and model scales. Our results highlight the limitations of using large language models to create effective sparse representations in zero-shot settings, identifying challenges such as inappropriate term expansions and reduced performance due to the lack of expansion. We find that the encoder-decoder architecture with multi-tokens decoding approach achieves the best performance among the three backbones. While the decoder-only model performs worse than the encoder-only model, it demonstrates the potential to outperform when scaled to a high number of parameters.

Paper Structure

This paper contains 11 sections, 10 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Output Bags of Tokens Produced by Different Sparse Representation Heads; Zero-shot encoder (FlanT5-xl) misses many important expansion terms. The MLM-MultiTokens head captures more relevant tokens than the MLM-SingleToken head by gathering contextual information from all input token rather than a single token.
  • Figure 2: Learned sparse retrieval architectures consist of (1) a transformer backbone that takes query or document text as input and outputs hidden state(s) and (2) a sparse representation head that takes the hidden state(s) as input and outputs sparse lexical representations.
  • Figure 3: Score distributions of the two teacher models on the MS MARCO training set. (a) RankLLama-13B exhibits a sharper distribution than MiniLM-L-6-v2. (b) We apply an affine transformation to align the mean and standard deviation distribution of RankLama-13B with that of MiniLM-L-6-v2.