INSTRUCTIR: A Benchmark for Instruction Following of Information Retrieval Models

Hanseok Oh; Hyunji Lee; Seonghyeon Ye; Haebin Shin; Hansol Jang; Changwook Jun; Minjoon Seo

INSTRUCTIR: A Benchmark for Instruction Following of Information Retrieval Models

Hanseok Oh, Hyunji Lee, Seonghyeon Ye, Haebin Shin, Hansol Jang, Changwook Jun, Minjoon Seo

TL;DR

InstructIR presents a dedicated benchmark for instruction-following in information retrieval by generating instance-specific, user-aligned instructions and revised targets via GPT-4, enabling robust evaluation of retrieval models under diverse search scenarios. The dataset comprises 9,906 instances across 1,267 queries, with a novel Robustness score capturing stability under instruction variation. Experimental results reveal that task-style instruction tuning can underperform non-instruction-tuned baselines, while larger models and certain instruction-tuned systems (notably E5-mistral-7b-instruct) offer robustness gains, suggesting overfitting risks and the critical role of instruction type. The work highlights the need for broader, user-centric instruction data and potential RLHF approaches to align retrieval with real-world user intents, advancing practical instruction-aware search systems.

Abstract

Despite the critical need to align search targets with users' intention, retrievers often only prioritize query information without delving into the users' intended search context. Enhancing the capability of retrievers to understand intentions and preferences of users, akin to language model instructions, has the potential to yield more aligned search targets. Prior studies restrict the application of instructions in information retrieval to a task description format, neglecting the broader context of diverse and evolving search scenarios. Furthermore, the prevailing benchmarks utilized for evaluation lack explicit tailoring to assess instruction-following ability, thereby hindering progress in this field. In response to these limitations, we propose a novel benchmark,INSTRUCTIR, specifically designed to evaluate instruction-following ability in information retrieval tasks. Our approach focuses on user-aligned instructions tailored to each query instance, reflecting the diverse characteristics inherent in real-world search scenarios. Through experimental analysis, we observe that retrievers fine-tuned to follow task-style instructions, such as INSTRUCTOR, can underperform compared to their non-instruction-tuned counterparts. This underscores potential overfitting issues inherent in constructing retrievers trained on existing instruction-aware retrieval datasets.

INSTRUCTIR: A Benchmark for Instruction Following of Information Retrieval Models

TL;DR

Abstract

Paper Structure (35 sections, 10 figures, 8 tables)

This paper contains 35 sections, 10 figures, 8 tables.

Introduction
Related Works
Evaluation for Instruction Following.
Instruction Following in Information Retrieval.
The InstructIR Benchmark
Data Creation Pipeline
Step 1. Select Seed Examples.
Step 2. Generate Instructions.
Step 3. Revise Target Text.
Step 4. Filtering Process.
Dataset Analysis
Comparison Table.
Dataset Quality.
Dataset Diversity and Statistics.
Evaluation Metric
...and 20 more sections

Figures (10)

Figure 1: InstructIR benchmark is designed to evaluate instruction following ability in information retrieval tasks. As unique user-aligned instructions change, different search targets should be retrieved to reflect real-world search scenarios.
Figure 2: Overview of data creation pipeline for building InstructIR benchmark. To build datasets that demand diverse user-aligned instructions for each query, we begin by selecting seed examples from the MSMARCO datasets. Subsequently, we generate a variety of instructions suitable for each query, revise the target text to align with these instructions, and systematically filter the generated content. The resulting dataset is used for InstructIR benchmark. GPT-4 is employed in this generation pipeline.
Figure 3: Prompt sensitivity per models. Blue bar and orange bar denote performance of original instructions and smallest score of paraphrased versions respectively.
Figure 4: Prompt for generating instructions (step 2)
Figure 5: Prompt for revising target text (step 3)
...and 5 more figures

INSTRUCTIR: A Benchmark for Instruction Following of Information Retrieval Models

TL;DR

Abstract

INSTRUCTIR: A Benchmark for Instruction Following of Information Retrieval Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)