Table of Contents
Fetching ...

MAIR: A Massive Benchmark for Evaluating Instructed Retrieval

Weiwei Sun, Zhengliang Shi, Jiulong Wu, Lingyong Yan, Xinyu Ma, Yiding Liu, Min Cao, Dawei Yin, Zhaochun Ren

TL;DR

MAIR tackles the challenge of evaluating instruction-following capabilities in information retrieval by introducing a massive, heterogeneous benchmark with 126 tasks across 6 domains, each annotated with 805 retrieval instructions. It combines data collection, sampling, and manual instruction annotation to assemble 10,038 queries over 4,274,916 documents, enabling robust testing of instruction-tuned retrievers and rerankers. The study shows that instruction-tuned text embeddings with instruction inputs generally outperform non-instruction-tuned counterparts, with GritLM-7B often leading overall performance, though challenges remain in long-tail and complex instruction tasks (as highlighted by the IFEval analysis). MAIR’s findings emphasize the importance of instruction-following capabilities in IR and provide a publicly available, scalable testbed for advancing instruction-tuned retrieval models, while also exposing current limitations and directions for future work, including multilingual extensions and prompt sensitivity considerations.

Abstract

Recent information retrieval (IR) models are pre-trained and instruction-tuned on massive datasets and tasks, enabling them to perform well on a wide range of tasks and potentially generalize to unseen tasks with instructions. However, existing IR benchmarks focus on a limited scope of tasks, making them insufficient for evaluating the latest IR models. In this paper, we propose MAIR (Massive Instructed Retrieval Benchmark), a heterogeneous IR benchmark that includes 126 distinct IR tasks across 6 domains, collected from existing datasets. We benchmark state-of-the-art instruction-tuned text embedding models and re-ranking models. Our experiments reveal that instruction-tuned models generally achieve superior performance compared to non-instruction-tuned models on MAIR. Additionally, our results suggest that current instruction-tuned text embedding models and re-ranking models still lack effectiveness in specific long-tail tasks. MAIR is publicly available at https://github.com/sunnweiwei/Mair.

MAIR: A Massive Benchmark for Evaluating Instructed Retrieval

TL;DR

MAIR tackles the challenge of evaluating instruction-following capabilities in information retrieval by introducing a massive, heterogeneous benchmark with 126 tasks across 6 domains, each annotated with 805 retrieval instructions. It combines data collection, sampling, and manual instruction annotation to assemble 10,038 queries over 4,274,916 documents, enabling robust testing of instruction-tuned retrievers and rerankers. The study shows that instruction-tuned text embeddings with instruction inputs generally outperform non-instruction-tuned counterparts, with GritLM-7B often leading overall performance, though challenges remain in long-tail and complex instruction tasks (as highlighted by the IFEval analysis). MAIR’s findings emphasize the importance of instruction-following capabilities in IR and provide a publicly available, scalable testbed for advancing instruction-tuned retrieval models, while also exposing current limitations and directions for future work, including multilingual extensions and prompt sensitivity considerations.

Abstract

Recent information retrieval (IR) models are pre-trained and instruction-tuned on massive datasets and tasks, enabling them to perform well on a wide range of tasks and potentially generalize to unseen tasks with instructions. However, existing IR benchmarks focus on a limited scope of tasks, making them insufficient for evaluating the latest IR models. In this paper, we propose MAIR (Massive Instructed Retrieval Benchmark), a heterogeneous IR benchmark that includes 126 distinct IR tasks across 6 domains, collected from existing datasets. We benchmark state-of-the-art instruction-tuned text embedding models and re-ranking models. Our experiments reveal that instruction-tuned models generally achieve superior performance compared to non-instruction-tuned models on MAIR. Additionally, our results suggest that current instruction-tuned text embedding models and re-ranking models still lack effectiveness in specific long-tail tasks. MAIR is publicly available at https://github.com/sunnweiwei/Mair.

Paper Structure

This paper contains 27 sections, 7 figures, 13 tables.

Figures (7)

  • Figure 1: Compared to other datasets, Mair covers a more diverse types of task. Bubble size represents the number of tasks of each type.
  • Figure 2: Visualization of the correlation among 126 tasks in Mair, with annotations for tasks from BEIR. Mair includes more diverse tasks. Task similarity is determined based on the performance correlation of all baseline models. We employ KMeans for clustering and t-SNE for visualization.
  • Figure 3: The performance correlation of baseline models with different sampled numbers of queries. Sampling 100 queries achieves a good trade-off between correlation and cost.
  • Figure 4: With the addition of instruction, the number of tasks that obtain performance improvement (green part) and reduction (red part). We can see that instruction-tuned models show more improvements while non-instruction-tuned models reduce on most tasks.
  • Figure 5: Score between MTEB (Retrieval) and Mair.
  • ...and 2 more figures