MAIR: A Massive Benchmark for Evaluating Instructed Retrieval

Weiwei Sun; Zhengliang Shi; Jiulong Wu; Lingyong Yan; Xinyu Ma; Yiding Liu; Min Cao; Dawei Yin; Zhaochun Ren

MAIR: A Massive Benchmark for Evaluating Instructed Retrieval

Weiwei Sun, Zhengliang Shi, Jiulong Wu, Lingyong Yan, Xinyu Ma, Yiding Liu, Min Cao, Dawei Yin, Zhaochun Ren

TL;DR

MAIR tackles the challenge of evaluating instruction-following capabilities in information retrieval by introducing a massive, heterogeneous benchmark with 126 tasks across 6 domains, each annotated with 805 retrieval instructions. It combines data collection, sampling, and manual instruction annotation to assemble 10,038 queries over 4,274,916 documents, enabling robust testing of instruction-tuned retrievers and rerankers. The study shows that instruction-tuned text embeddings with instruction inputs generally outperform non-instruction-tuned counterparts, with GritLM-7B often leading overall performance, though challenges remain in long-tail and complex instruction tasks (as highlighted by the IFEval analysis). MAIR’s findings emphasize the importance of instruction-following capabilities in IR and provide a publicly available, scalable testbed for advancing instruction-tuned retrieval models, while also exposing current limitations and directions for future work, including multilingual extensions and prompt sensitivity considerations.

Abstract

Recent information retrieval (IR) models are pre-trained and instruction-tuned on massive datasets and tasks, enabling them to perform well on a wide range of tasks and potentially generalize to unseen tasks with instructions. However, existing IR benchmarks focus on a limited scope of tasks, making them insufficient for evaluating the latest IR models. In this paper, we propose MAIR (Massive Instructed Retrieval Benchmark), a heterogeneous IR benchmark that includes 126 distinct IR tasks across 6 domains, collected from existing datasets. We benchmark state-of-the-art instruction-tuned text embedding models and re-ranking models. Our experiments reveal that instruction-tuned models generally achieve superior performance compared to non-instruction-tuned models on MAIR. Additionally, our results suggest that current instruction-tuned text embedding models and re-ranking models still lack effectiveness in specific long-tail tasks. MAIR is publicly available at https://github.com/sunnweiwei/Mair.

MAIR: A Massive Benchmark for Evaluating Instructed Retrieval

TL;DR

Abstract

MAIR: A Massive Benchmark for Evaluating Instructed Retrieval

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)