Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT

Jon Saad-Falcon; Daniel Y. Fu; Simran Arora; Neel Guha; Christopher Ré

Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT

Jon Saad-Falcon, Daniel Y. Fu, Simran Arora, Neel Guha, Christopher Ré

TL;DR

This paper tackles the challenge of long-context retrieval by introducing LoCoV1, a benchmark that exposes the need for reasoning over long documents, and presenting M2-BERT, an 80M-parameter state-space retrieval encoder based on the Monarch Mixer that handles up to 32K tokens. The authors develop a mixed short/long pretraining regime and a batch-size–independent fine-tuning approach using orthogonal projection loss, enabling effective retrieval under GPU memory constraints. Empirical results show M2-BERT-32k achieving state-of-the-art or competitive performance on LoCoV1 and BEIR while vastly outperforming similarly-sized Transformer baselines in speed and efficiency. The work also demonstrates the broader applicability of the learned embeddings to clustering and MTEB tasks, highlighting the practical impact for long-context information retrieval in real-world domains.

Abstract

Retrieval pipelines-an integral component of many machine learning systems-perform poorly in domains where documents are long (e.g., 10K tokens or more) and where identifying the relevant document requires synthesizing information across the entire text. Developing long-context retrieval encoders suitable for these domains raises three challenges: (1) how to evaluate long-context retrieval performance, (2) how to pretrain a base language model to represent both short contexts (corresponding to queries) and long contexts (corresponding to documents), and (3) how to fine-tune this model for retrieval under the batch size limitations imposed by GPU memory constraints. To address these challenges, we first introduce LoCoV1, a novel 12 task benchmark constructed to measure long-context retrieval where chunking is not possible or not effective. We next present the M2-BERT retrieval encoder, an 80M parameter state-space encoder model built from the Monarch Mixer architecture, capable of scaling to documents up to 32K tokens long. We describe a pretraining data mixture which allows this encoder to process both short and long context sequences, and a finetuning approach that adapts this base model to retrieval with only single-sample batches. Finally, we validate the M2-BERT retrieval encoder on LoCoV1, finding that it outperforms competitive Transformer-based models by at least 23.3 points, despite containing upwards of 90x fewer parameters.

Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT

TL;DR

Abstract

Paper Structure (28 sections, 3 equations, 6 figures, 23 tables)

This paper contains 28 sections, 3 equations, 6 figures, 23 tables.

Introduction
Related Work
LoCoV1 Retrieval Benchmark
M2-BERT Retrieval Encoder
Architecture
Pretraining
Fine-tuning
Experiments
Comparing M2-BERT to Existing Retriever Models
Ablation of Pretraining and Finetuning
Applications of M2-BERT Retrieval Encoders
Conclusion
Acknowledgements
Appendix
LoCoV1 Overview
...and 13 more sections

Figures (6)

Figure 1: Left: The LoCoV1 long document retrieval benchmark and the average document length of its constituent datasets. Center Left: M2-BERT sequence mixer. Center Right: The orthogonal projection loss. Right: Performance of various retrieval models and M2-BERT at different sequence lengths on LoCoV1. Circles are open models, where circle area corresponds to model size. X marks are closed models.
Figure 2: M2-BERT and Baseline Model Performance on Needle-in-the-Haystack Synthetic Task.
Figure 3: t-SNE Visualization of M2-BERT-32K Embeddings of RedPajama-V1 sample.
Figure 4: Cold vs. Warm Start for M2-BERT-32k Pretraining Checkpoints.
Figure 5: LoCoV1 Document Token Count Distributions.
...and 1 more figures

Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT

TL;DR

Abstract

Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT

Authors

TL;DR

Abstract

Table of Contents

Figures (6)