Table of Contents
Fetching ...

Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT

Jon Saad-Falcon, Daniel Y. Fu, Simran Arora, Neel Guha, Christopher Ré

TL;DR

This paper tackles the challenge of long-context retrieval by introducing LoCoV1, a benchmark that exposes the need for reasoning over long documents, and presenting M2-BERT, an 80M-parameter state-space retrieval encoder based on the Monarch Mixer that handles up to 32K tokens. The authors develop a mixed short/long pretraining regime and a batch-size–independent fine-tuning approach using orthogonal projection loss, enabling effective retrieval under GPU memory constraints. Empirical results show M2-BERT-32k achieving state-of-the-art or competitive performance on LoCoV1 and BEIR while vastly outperforming similarly-sized Transformer baselines in speed and efficiency. The work also demonstrates the broader applicability of the learned embeddings to clustering and MTEB tasks, highlighting the practical impact for long-context information retrieval in real-world domains.

Abstract

Retrieval pipelines-an integral component of many machine learning systems-perform poorly in domains where documents are long (e.g., 10K tokens or more) and where identifying the relevant document requires synthesizing information across the entire text. Developing long-context retrieval encoders suitable for these domains raises three challenges: (1) how to evaluate long-context retrieval performance, (2) how to pretrain a base language model to represent both short contexts (corresponding to queries) and long contexts (corresponding to documents), and (3) how to fine-tune this model for retrieval under the batch size limitations imposed by GPU memory constraints. To address these challenges, we first introduce LoCoV1, a novel 12 task benchmark constructed to measure long-context retrieval where chunking is not possible or not effective. We next present the M2-BERT retrieval encoder, an 80M parameter state-space encoder model built from the Monarch Mixer architecture, capable of scaling to documents up to 32K tokens long. We describe a pretraining data mixture which allows this encoder to process both short and long context sequences, and a finetuning approach that adapts this base model to retrieval with only single-sample batches. Finally, we validate the M2-BERT retrieval encoder on LoCoV1, finding that it outperforms competitive Transformer-based models by at least 23.3 points, despite containing upwards of 90x fewer parameters.

Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT

TL;DR

This paper tackles the challenge of long-context retrieval by introducing LoCoV1, a benchmark that exposes the need for reasoning over long documents, and presenting M2-BERT, an 80M-parameter state-space retrieval encoder based on the Monarch Mixer that handles up to 32K tokens. The authors develop a mixed short/long pretraining regime and a batch-size–independent fine-tuning approach using orthogonal projection loss, enabling effective retrieval under GPU memory constraints. Empirical results show M2-BERT-32k achieving state-of-the-art or competitive performance on LoCoV1 and BEIR while vastly outperforming similarly-sized Transformer baselines in speed and efficiency. The work also demonstrates the broader applicability of the learned embeddings to clustering and MTEB tasks, highlighting the practical impact for long-context information retrieval in real-world domains.

Abstract

Retrieval pipelines-an integral component of many machine learning systems-perform poorly in domains where documents are long (e.g., 10K tokens or more) and where identifying the relevant document requires synthesizing information across the entire text. Developing long-context retrieval encoders suitable for these domains raises three challenges: (1) how to evaluate long-context retrieval performance, (2) how to pretrain a base language model to represent both short contexts (corresponding to queries) and long contexts (corresponding to documents), and (3) how to fine-tune this model for retrieval under the batch size limitations imposed by GPU memory constraints. To address these challenges, we first introduce LoCoV1, a novel 12 task benchmark constructed to measure long-context retrieval where chunking is not possible or not effective. We next present the M2-BERT retrieval encoder, an 80M parameter state-space encoder model built from the Monarch Mixer architecture, capable of scaling to documents up to 32K tokens long. We describe a pretraining data mixture which allows this encoder to process both short and long context sequences, and a finetuning approach that adapts this base model to retrieval with only single-sample batches. Finally, we validate the M2-BERT retrieval encoder on LoCoV1, finding that it outperforms competitive Transformer-based models by at least 23.3 points, despite containing upwards of 90x fewer parameters.
Paper Structure (28 sections, 3 equations, 6 figures, 23 tables)

This paper contains 28 sections, 3 equations, 6 figures, 23 tables.

Figures (6)

  • Figure 1: Left: The LoCoV1 long document retrieval benchmark and the average document length of its constituent datasets. Center Left: M2-BERT sequence mixer. Center Right: The orthogonal projection loss. Right: Performance of various retrieval models and M2-BERT at different sequence lengths on LoCoV1. Circles are open models, where circle area corresponds to model size. X marks are closed models.
  • Figure 2: M2-BERT and Baseline Model Performance on Needle-in-the-Haystack Synthetic Task.
  • Figure 3: t-SNE Visualization of M2-BERT-32K Embeddings of RedPajama-V1 sample.
  • Figure 4: Cold vs. Warm Start for M2-BERT-32k Pretraining Checkpoints.
  • Figure 5: LoCoV1 Document Token Count Distributions.
  • ...and 1 more figures