FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions

Orion Weller; Benjamin Chang; Sean MacAvaney; Kyle Lo; Arman Cohan; Benjamin Van Durme; Dawn Lawrie; Luca Soldaini

FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions

Orion Weller, Benjamin Chang, Sean MacAvaney, Kyle Lo, Arman Cohan, Benjamin Van Durme, Dawn Lawrie, Luca Soldaini

TL;DR

FollowIR tackles the problem that IR models rarely follow detailed natural-language instructions. It introduces a benchmark built from TREC narratives and a paired-instruction evaluation framework, plus a training set to teach instruction-following. The results show that standard IR models struggle with long instructions unless they are large or instruction-tuned, whereas FollowIR-7B demonstrates substantial gains after fine-tuning. The work also provides an open dataset and an instruction-following model to spur development of more capable, instruction-aware IR systems.

Abstract

Modern Language Models (LMs) are capable of following long and complex instructions that enable a large and diverse set of user requests. While Information Retrieval (IR) models use these LMs as the backbone of their architectures, virtually none of them allow users to provide detailed instructions alongside queries, thus limiting their ability to satisfy complex information needs. In this work, we study the use of instructions in IR systems. First, we introduce our dataset FollowIR, which contains a rigorous instruction evaluation benchmark as well as a training set for helping IR models learn to better follow real-world instructions. FollowIR repurposes detailed instructions -- also known as narratives -- developed for professional assessors to evaluate retrieval systems. In particular, we build our benchmark from three collections curated for shared tasks at the Text REtrieval Conference (TREC). These collections contains hundreds to thousands of labeled documents per query, making them suitable for our exploration. Through this process, we can measure how well IR models follow instructions, through a new pairwise evaluation framework. Our results indicate that existing retrieval models fail to correctly use instructions, using them for basic keywords and struggling to understand long-form information. However, we show that it is possible for IR models to learn to follow complex instructions: our new FollowIR-7B model has significant improvements after fine-tuning on our training set.

FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions

TL;DR

Abstract

Paper Structure (31 sections, 1 equation, 4 figures, 5 tables)

This paper contains 31 sections, 1 equation, 4 figures, 5 tables.

Introduction
Related Work
TREC Conferences
Instructions for LMs
Instructions for Retrieval
Building FollowIR
Evaluation Metrics for FollowIR
Evaluating Instruction Following
Evaluation Settings
No Instructions in Training
Instructions in IR Training
API Models
Instruction-Tuned LMs
Results
No-Instruction IR Models
...and 16 more sections

Figures (4)

Figure 1: How do standard retrieval queries differ from instructions (or narratives)? Instructions contain more specific details about what is relevant, include less directly-relevant background information, and often have directives about what documents are not relevant, using negation. %'s are how often a certain type of content appears in the original TREC instructions used in FollowIR.
Figure 2: A visual depiction of the pairwise evaluation framework: models are evaluated on the query with the original instruction, and then on the query with the altered instruction. If the model correctly understands the instructions, it will change which documents are relevant w.r.t. the alteration (right). Note that the real-world instructions (left) given to TREC annotators includes fine-grained details about what relevance is, as well as instructions containing negation (in bold).
Figure 3: Score difference between using no instructions to using instructions formatted as keywords, short text, or the full text. While models that can correctly use instructions see gains with the additional information, most other models see decreasing performance as instruction length increases.
Figure 4: Performance on the InstructIR benchmark using their "Robustness@10" scores, e.g. the min nDCG@10 score across 10 instructions. Upper portion is bi-encoders while lower is rerankers.

FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions

TL;DR

Abstract

FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions

Authors

TL;DR

Abstract

Table of Contents

Figures (4)