Table of Contents
Fetching ...

Revela: Dense Retriever Learning via Language Modeling

Fengyu Cai, Tong Chen, Xinran Zhao, Sihao Chen, Hongming Zhang, Sherry Tongshuang Wu, Iryna Gurevych, Heinz Koeppl

TL;DR

Revela is introduced, a unified and scalable training framework for self-supervised retriever learning via language modeling that achieves BEIR's unsupervised SoTA with ~ 1000x less training data and 10x less compute.

Abstract

Dense retrievers play a vital role in accessing external and specialized knowledge to augment language models (LMs). Training dense retrievers typically requires annotated query-document pairs, which are costly to create and scarce in specialized domains (e.g., code) or in complex settings (e.g., requiring reasoning). These practical challenges have sparked growing interest in self-supervised retriever learning. Since LMs are trained to capture token-level dependencies through a self-supervised learning objective (i.e., next token prediction), we can analogously cast retrieval as learning dependencies among chunks of tokens. This analogy naturally leads to the question: How can we adapt self-supervised learning objectives in the spirit of language modeling to train retrievers? To answer this question, we introduce Revela, a unified and scalable training framework for self-supervised retriever learning via language modeling. Revela models semantic dependencies among documents by conditioning next token prediction on local and cross-document context through an in-batch attention mechanism. This attention is weighted by retriever-computed similarity scores, enabling the retriever to be optimized as part of language modeling. We evaluate Revela on domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks across various retriever backbones. Without annotated or synthetic query-document pairs, Revela surpasses larger supervised models and proprietary APIs on CoIR and matches them on BRIGHT. It achieves BEIR's unsupervised SoTA with ~ 1000x less training data and 10x less compute. Performance increases with batch size and model size, highlighting Revela's scalability and its promise for self-supervised retriever learning.

Revela: Dense Retriever Learning via Language Modeling

TL;DR

Revela is introduced, a unified and scalable training framework for self-supervised retriever learning via language modeling that achieves BEIR's unsupervised SoTA with ~ 1000x less training data and 10x less compute.

Abstract

Dense retrievers play a vital role in accessing external and specialized knowledge to augment language models (LMs). Training dense retrievers typically requires annotated query-document pairs, which are costly to create and scarce in specialized domains (e.g., code) or in complex settings (e.g., requiring reasoning). These practical challenges have sparked growing interest in self-supervised retriever learning. Since LMs are trained to capture token-level dependencies through a self-supervised learning objective (i.e., next token prediction), we can analogously cast retrieval as learning dependencies among chunks of tokens. This analogy naturally leads to the question: How can we adapt self-supervised learning objectives in the spirit of language modeling to train retrievers? To answer this question, we introduce Revela, a unified and scalable training framework for self-supervised retriever learning via language modeling. Revela models semantic dependencies among documents by conditioning next token prediction on local and cross-document context through an in-batch attention mechanism. This attention is weighted by retriever-computed similarity scores, enabling the retriever to be optimized as part of language modeling. We evaluate Revela on domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks across various retriever backbones. Without annotated or synthetic query-document pairs, Revela surpasses larger supervised models and proprietary APIs on CoIR and matches them on BRIGHT. It achieves BEIR's unsupervised SoTA with ~ 1000x less training data and 10x less compute. Performance increases with batch size and model size, highlighting Revela's scalability and its promise for self-supervised retriever learning.

Paper Structure

This paper contains 49 sections, 14 equations, 7 figures, 18 tables.

Figures (7)

  • Figure 1: The framework of Revela. The retriever's in-batch similarity scores are used as in-batch attention weights inside transformer blocks. The retriever is trained by optimizing the language modeling objective, i.e., NTP. The related patterns in red and purple sequences are highlighted in bold and underline. An example of training dynamics is illustrated at \ref{['appendix:illustration']}.
  • Figure 2: Revela's architecture. With an attention map, the embeddings of in-batch attention $\{\mathrm{h}_i^l\}_{i=1}^{B}$ can attend to the self-attention $\{\mathrm{e}_{i}^l\}_{i=1}^{B}$.
  • Figure 3: Performance on BRIGHT (Left) and BEIR (Right) (NDCG@10,%). Results for Revela are shown in opaque bars, while all other models are represented by transparent bars. On BRIGHT, a reasoning-intensive task, Revela performs on par with large supervised models and properties APIs. On BEIR, Revela achieves similar performance with E5-PT with much less data and compute. Please refer to \ref{['tab:beir_results']} and \ref{['tab:bright_results']} in \ref{['appendix:supplementary']} for the per-task results.
  • Figure 4: Performance comparison on CoIR and BEIR with different batch sizes. For both benchmarks, Revela performance generally scales with batch size.
  • Figure 5: Performance comparison on CoIR and BEIR using various combinations of retrievers and LMs. For code retrieval tasks, larger LMs can yield greater gains in retriever performance.
  • ...and 2 more figures