Table of Contents
Fetching ...

Drop your Decoder: Pre-training with Bag-of-Word Prediction for Dense Passage Retrieval

Guangyuan Ma, Xing Wu, Zijia Lin, Songlin Hu

TL;DR

The paper tackles the opacity and computational burden of masked auto-encoder (MAE) pre-training for dense passage retrieval. It reveals that MAE with enhanced decoding increases input-term coverage in dense representations, motivating a decoder-free Bag-of-Word (BoW) pre-training that directly predicts the input token bag. BoW achieves state-of-the-art retrieval on MS-MARCO, NQ, TriviaQA, and BEIR without extra decoders, while delivering substantial training speed-ups (up to ~67% faster than enhanced MAE). The approach provides high interpretability and easy implementation, highlighting the value of directly compressing lexical information into dense representations for robust retrieval across domains.

Abstract

Masked auto-encoder pre-training has emerged as a prevalent technique for initializing and enhancing dense retrieval systems. It generally utilizes additional Transformer decoder blocks to provide sustainable supervision signals and compress contextual information into dense representations. However, the underlying reasons for the effectiveness of such a pre-training technique remain unclear. The usage of additional Transformer-based decoders also incurs significant computational costs. In this study, we aim to shed light on this issue by revealing that masked auto-encoder (MAE) pre-training with enhanced decoding significantly improves the term coverage of input tokens in dense representations, compared to vanilla BERT checkpoints. Building upon this observation, we propose a modification to the traditional MAE by replacing the decoder of a masked auto-encoder with a completely simplified Bag-of-Word prediction task. This modification enables the efficient compression of lexical signals into dense representations through unsupervised pre-training. Remarkably, our proposed method achieves state-of-the-art retrieval performance on several large-scale retrieval benchmarks without requiring any additional parameters, which provides a 67% training speed-up compared to standard masked auto-encoder pre-training with enhanced decoding.

Drop your Decoder: Pre-training with Bag-of-Word Prediction for Dense Passage Retrieval

TL;DR

The paper tackles the opacity and computational burden of masked auto-encoder (MAE) pre-training for dense passage retrieval. It reveals that MAE with enhanced decoding increases input-term coverage in dense representations, motivating a decoder-free Bag-of-Word (BoW) pre-training that directly predicts the input token bag. BoW achieves state-of-the-art retrieval on MS-MARCO, NQ, TriviaQA, and BEIR without extra decoders, while delivering substantial training speed-ups (up to ~67% faster than enhanced MAE). The approach provides high interpretability and easy implementation, highlighting the value of directly compressing lexical information into dense representations for robust retrieval across domains.

Abstract

Masked auto-encoder pre-training has emerged as a prevalent technique for initializing and enhancing dense retrieval systems. It generally utilizes additional Transformer decoder blocks to provide sustainable supervision signals and compress contextual information into dense representations. However, the underlying reasons for the effectiveness of such a pre-training technique remain unclear. The usage of additional Transformer-based decoders also incurs significant computational costs. In this study, we aim to shed light on this issue by revealing that masked auto-encoder (MAE) pre-training with enhanced decoding significantly improves the term coverage of input tokens in dense representations, compared to vanilla BERT checkpoints. Building upon this observation, we propose a modification to the traditional MAE by replacing the decoder of a masked auto-encoder with a completely simplified Bag-of-Word prediction task. This modification enables the efficient compression of lexical signals into dense representations through unsupervised pre-training. Remarkably, our proposed method achieves state-of-the-art retrieval performance on several large-scale retrieval benchmarks without requiring any additional parameters, which provides a 67% training speed-up compared to standard masked auto-encoder pre-training with enhanced decoding.
Paper Structure (27 sections, 13 equations, 5 figures, 6 tables, 2 algorithms)

This paper contains 27 sections, 13 equations, 5 figures, 6 tables, 2 algorithms.

Figures (5)

  • Figure 1: Comparison of Masked Auto-Encoder Pre-training and Bag-of-Word Prediction Pre-training.
  • Figure 2: Compositions of Top-k tokens of dense representation.
  • Figure 3: Examples of the compositions of Top-20 tokens of dense representation. We encode the input texts with various encoders and project them to the vocabulary space to interpret dominating tokens. ✓ means a token hits the input text, while ✗ means miss of the input text.
  • Figure 4: Input token coverage about Top-k tokens of dense representation after Bag-of-Word prediction pre-training.
  • Figure 5: Examples of the compositions of Top-20 tokens of dense representation after pre-training with Bag-of-Word prediction.