Table of Contents
Fetching ...

RepBERT: Contextualized Text Embeddings for First-Stage Retrieval

Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Min Zhang, Shaoping Ma

TL;DR

This paper introduces RepBERT, a BERT-based, fixed-length contextual embedding model for first-stage retrieval that uses inner-product scores to rank documents. By encoding documents offline and querying online, RepBERT achieves state-of-the-art first-stage performance on MS MARCO Passage Ranking with efficiency on par with bag-of-words methods. The work analyzes training dynamics, recalls, and reranking interactions, highlighting benefits and mismatches when integrating semantic and exact-match signals, and demonstrates that combining RepBERT with traditional exact-match retrievers yields further gains. Overall, RepBERT shows the feasibility and value of representation-focused neural methods for scalable, high-quality initial retrieval.

Abstract

Although exact term match between queries and documents is the dominant method to perform first-stage retrieval, we propose a different approach, called RepBERT, to represent documents and queries with fixed-length contextualized embeddings. The inner products of query and document embeddings are regarded as relevance scores. On MS MARCO Passage Ranking task, RepBERT achieves state-of-the-art results among all initial retrieval techniques. And its efficiency is comparable to bag-of-words methods.

RepBERT: Contextualized Text Embeddings for First-Stage Retrieval

TL;DR

This paper introduces RepBERT, a BERT-based, fixed-length contextual embedding model for first-stage retrieval that uses inner-product scores to rank documents. By encoding documents offline and querying online, RepBERT achieves state-of-the-art first-stage performance on MS MARCO Passage Ranking with efficiency on par with bag-of-words methods. The work analyzes training dynamics, recalls, and reranking interactions, highlighting benefits and mismatches when integrating semantic and exact-match signals, and demonstrates that combining RepBERT with traditional exact-match retrievers yields further gains. Overall, RepBERT shows the feasibility and value of representation-focused neural methods for scalable, high-quality initial retrieval.

Abstract

Although exact term match between queries and documents is the dominant method to perform first-stage retrieval, we propose a different approach, called RepBERT, to represent documents and queries with fixed-length contextualized embeddings. The inner products of query and document embeddings are regarded as relevance scores. On MS MARCO Passage Ranking task, RepBERT achieves state-of-the-art results among all initial retrieval techniques. And its efficiency is comparable to bag-of-words methods.

Paper Structure

This paper contains 19 sections, 6 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: At different depths, the recall of the first-stage retrieval method and the reranking accuracy of BERT Large. Dataset: MS MARCO dev.
  • Figure 2: For a certain depth, the average proportion of retrieved documents that are also in the official top-1000 candidates provided by MS MARCO. Dataset: MS MARCO dev.