Revela: Dense Retriever Learning via Language Modeling

Fengyu Cai; Tong Chen; Xinran Zhao; Sihao Chen; Hongming Zhang; Sherry Tongshuang Wu; Iryna Gurevych; Heinz Koeppl

Revela: Dense Retriever Learning via Language Modeling

Fengyu Cai, Tong Chen, Xinran Zhao, Sihao Chen, Hongming Zhang, Sherry Tongshuang Wu, Iryna Gurevych, Heinz Koeppl

TL;DR

Revela is introduced, a unified and scalable training framework for self-supervised retriever learning via language modeling that achieves BEIR's unsupervised SoTA with ~ 1000x less training data and 10x less compute.

Abstract

Dense retrievers play a vital role in accessing external and specialized knowledge to augment language models (LMs). Training dense retrievers typically requires annotated query-document pairs, which are costly to create and scarce in specialized domains (e.g., code) or in complex settings (e.g., requiring reasoning). These practical challenges have sparked growing interest in self-supervised retriever learning. Since LMs are trained to capture token-level dependencies through a self-supervised learning objective (i.e., next token prediction), we can analogously cast retrieval as learning dependencies among chunks of tokens. This analogy naturally leads to the question: How can we adapt self-supervised learning objectives in the spirit of language modeling to train retrievers? To answer this question, we introduce Revela, a unified and scalable training framework for self-supervised retriever learning via language modeling. Revela models semantic dependencies among documents by conditioning next token prediction on local and cross-document context through an in-batch attention mechanism. This attention is weighted by retriever-computed similarity scores, enabling the retriever to be optimized as part of language modeling. We evaluate Revela on domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks across various retriever backbones. Without annotated or synthetic query-document pairs, Revela surpasses larger supervised models and proprietary APIs on CoIR and matches them on BRIGHT. It achieves BEIR's unsupervised SoTA with ~ 1000x less training data and 10x less compute. Performance increases with batch size and model size, highlighting Revela's scalability and its promise for self-supervised retriever learning.

Revela: Dense Retriever Learning via Language Modeling

TL;DR

Abstract

Revela: Dense Retriever Learning via Language Modeling

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)