Table of Contents
Fetching ...

SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, Furu Wei

TL;DR

Dense passage retrieval requires robust, transferable representations learned from unlabeled data. SimLM pre-trains an encoder–decoder with a bottleneck at the [CLS] token using a replaced language modeling objective inspired by ELECTRA, forcing the encoder to compress semantic information into a single vector $h_{cls}$ and enabling effective initialization for biencoder-based retrievers. The approach yields state-of-the-art results on MS-MARCO and competitive performance on Natural Questions, while offering storage and inference advantages over multi-vector methods like ColBERTv2. This simple, label-free pre-training plus a scalable distillation-based fine-tuning pipeline has practical impact for building fast, accurate dense retrievers in large corpora and can extend to broader retrieval settings and multilingual scenarios.

Abstract

In this paper, we propose SimLM (Similarity matching with Language Model pre-training), a simple yet effective pre-training method for dense passage retrieval. It employs a simple bottleneck architecture that learns to compress the passage information into a dense vector through self-supervised pre-training. We use a replaced language modeling objective, which is inspired by ELECTRA, to improve the sample efficiency and reduce the mismatch of the input distribution between pre-training and fine-tuning. SimLM only requires access to unlabeled corpus, and is more broadly applicable when there are no labeled data or queries. We conduct experiments on several large-scale passage retrieval datasets, and show substantial improvements over strong baselines under various settings. Remarkably, SimLM even outperforms multi-vector approaches such as ColBERTv2 which incurs significantly more storage cost. Our code and model check points are available at https://github.com/microsoft/unilm/tree/master/simlm .

SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval

TL;DR

Dense passage retrieval requires robust, transferable representations learned from unlabeled data. SimLM pre-trains an encoder–decoder with a bottleneck at the [CLS] token using a replaced language modeling objective inspired by ELECTRA, forcing the encoder to compress semantic information into a single vector and enabling effective initialization for biencoder-based retrievers. The approach yields state-of-the-art results on MS-MARCO and competitive performance on Natural Questions, while offering storage and inference advantages over multi-vector methods like ColBERTv2. This simple, label-free pre-training plus a scalable distillation-based fine-tuning pipeline has practical impact for building fast, accurate dense retrievers in large corpora and can extend to broader retrieval settings and multilingual scenarios.

Abstract

In this paper, we propose SimLM (Similarity matching with Language Model pre-training), a simple yet effective pre-training method for dense passage retrieval. It employs a simple bottleneck architecture that learns to compress the passage information into a dense vector through self-supervised pre-training. We use a replaced language modeling objective, which is inspired by ELECTRA, to improve the sample efficiency and reduce the mismatch of the input distribution between pre-training and fine-tuning. SimLM only requires access to unlabeled corpus, and is more broadly applicable when there are no labeled data or queries. We conduct experiments on several large-scale passage retrieval datasets, and show substantial improvements over strong baselines under various settings. Remarkably, SimLM even outperforms multi-vector approaches such as ColBERTv2 which incurs significantly more storage cost. Our code and model check points are available at https://github.com/microsoft/unilm/tree/master/simlm .
Paper Structure (18 sections, 5 equations, 3 figures, 14 tables)

This paper contains 18 sections, 5 equations, 3 figures, 14 tables.

Figures (3)

  • Figure 1: Pre-training architecture of SimLM. Replaced tokens (underlined) are randomly sampled from the generator distribution.
  • Figure 2: Illustration of our supervised fine-tuning pipeline. Note that we only use SimLM to initialize the biencoder-based retrievers. For cross-encoder based re-ranker, we use off-the-shelf pre-trained models such as ELECTRA$_\text{base}$.
  • Figure 3: Our pre-training objective converges faster and consistently outperforms vanilla masked language model pre-training. The y-axis shows the MRR@10 on the dev set of MS-MARCO dataset.