Table of Contents
Fetching ...

Beyond Content Relevance: Evaluating Instruction Following in Retrieval Models

Jianqun Zhou, Yuanlei Zheng, Wei Chen, Qianqian Zheng, Hui Su, Wei Zhang, Rui Meng, Xiaoyu Shen

TL;DR

The paper tackles the gap between advancing instruction-following in LLMs and retrieval systems by introducing InfoSearch, a benchmark that evaluates retrieval models on six document-level dimensions (Audience, Keyword, Format, Language, Length, Source) across Original, Instructed, and Reversely Instructed modes. It adds two metrics, Strict Instruction Compliance Ratio (SICR) and Weighted Instruction Sensitivity Evaluation (WISE), to capture strict and graded instruction adherence beyond content relevance. Across 15 models, including sparse, dense, and reranking approaches, results show that list-wise reranking and large instruction-tuned models achieve the best instruction-following performance, with GPT-4o leading in WISE and SICR, yet overall compliance remains imperfect. The work provides a practical framework for evaluating instruction-following in retrieval, guiding future development toward more instruction-responsive search systems and better alignment with user-specific document attributes.

Abstract

Instruction-following capabilities in LLMs have progressed significantly, enabling more complex user interactions through detailed prompts. However, retrieval systems have not matched these advances, most of them still relies on traditional lexical and semantic matching techniques that fail to fully capture user intent. Recent efforts have introduced instruction-aware retrieval models, but these primarily focus on intrinsic content relevance, which neglects the importance of customized preferences for broader document-level attributes. This study evaluates the instruction-following capabilities of various retrieval models beyond content relevance, including LLM-based dense retrieval and reranking models. We develop InfoSearch, a novel retrieval evaluation benchmark spanning six document-level attributes: Audience, Keyword, Format, Language, Length, and Source, and introduce novel metrics -- Strict Instruction Compliance Ratio (SICR) and Weighted Instruction Sensitivity Evaluation (WISE) to accurately assess the models' responsiveness to instructions. Our findings indicate that although fine-tuning models on instruction-aware retrieval datasets and increasing model size enhance performance, most models still fall short of instruction compliance.

Beyond Content Relevance: Evaluating Instruction Following in Retrieval Models

TL;DR

The paper tackles the gap between advancing instruction-following in LLMs and retrieval systems by introducing InfoSearch, a benchmark that evaluates retrieval models on six document-level dimensions (Audience, Keyword, Format, Language, Length, Source) across Original, Instructed, and Reversely Instructed modes. It adds two metrics, Strict Instruction Compliance Ratio (SICR) and Weighted Instruction Sensitivity Evaluation (WISE), to capture strict and graded instruction adherence beyond content relevance. Across 15 models, including sparse, dense, and reranking approaches, results show that list-wise reranking and large instruction-tuned models achieve the best instruction-following performance, with GPT-4o leading in WISE and SICR, yet overall compliance remains imperfect. The work provides a practical framework for evaluating instruction-following in retrieval, guiding future development toward more instruction-responsive search systems and better alignment with user-specific document attributes.

Abstract

Instruction-following capabilities in LLMs have progressed significantly, enabling more complex user interactions through detailed prompts. However, retrieval systems have not matched these advances, most of them still relies on traditional lexical and semantic matching techniques that fail to fully capture user intent. Recent efforts have introduced instruction-aware retrieval models, but these primarily focus on intrinsic content relevance, which neglects the importance of customized preferences for broader document-level attributes. This study evaluates the instruction-following capabilities of various retrieval models beyond content relevance, including LLM-based dense retrieval and reranking models. We develop InfoSearch, a novel retrieval evaluation benchmark spanning six document-level attributes: Audience, Keyword, Format, Language, Length, and Source, and introduce novel metrics -- Strict Instruction Compliance Ratio (SICR) and Weighted Instruction Sensitivity Evaluation (WISE) to accurately assess the models' responsiveness to instructions. Our findings indicate that although fine-tuning models on instruction-aware retrieval datasets and increasing model size enhance performance, most models still fall short of instruction compliance.

Paper Structure

This paper contains 28 sections, 9 equations, 5 figures, 22 tables.

Figures (5)

  • Figure 1: InfoSearch consists of six dimensions, each representing a document-level feature with values drawn from predefined conditions. Queries are paired with one dimension and evaluated in three retrieval modes based on the given instructions.
  • Figure 2: Overview of the dataset construction process for InfoSearch.
  • Figure 3: Radar plots comparing the WISE scores of various models across different dimensions, highlighting the strengths and weaknesses of each model in handling different types of instructions. Among retrieval models, GritLM demonstrates the strongest instruction-following capability, while GPT-4 consistently performs the best across all dimensions in the reranking category.
  • Figure 4: Comparison of the InfoSearch dataset with FollowIR and InstructIR in terms of data distribution across six dimensions.
  • Figure 5: Heatmap of the rewards component