Table of Contents
Fetching ...

mFollowIR: a Multilingual Benchmark for Instruction Following in Retrieval

Orion Weller, Benjamin Chang, Eugene Yang, Mahsa Yarmohammadi, Sam Barham, Sean MacAvaney, Arman Cohan, Luca Soldaini, Benjamin Van Durme, Dawn Lawrie

TL;DR

The paper addresses the problem of evaluating instruction-following in retrieval across languages. It introduces mFollowIR, a multilingual benchmark built on NeuCLIR narratives in Russian, Chinese, and Persian, and uses controlled narrative edits to isolate instruction-following via a reranking setup measured with $nDCG@20$ and $p$-MRR. It demonstrates that English-instruction-trained retrievers offer cross-lingual transfer benefits but multilingual performance remains weaker, especially for bi-encoders, while cross-encoders and instruction-tuned models show the strongest instruction-following signals. The work contributes a new multilingual dataset, a rigorous evaluation framework, and insights into how instruction data and model scale influence multilingual retrieval, highlighting direction for future improvement in multilingual instruction-following capabilities.

Abstract

Retrieval systems generally focus on web-style queries that are short and underspecified. However, advances in language models have facilitated the nascent rise of retrieval models that can understand more complex queries with diverse intents. However, these efforts have focused exclusively on English; therefore, we do not yet understand how they work across languages. We introduce mFollowIR, a multilingual benchmark for measuring instruction-following ability in retrieval models. mFollowIR builds upon the TREC NeuCLIR narratives (or instructions) that span three diverse languages (Russian, Chinese, Persian) giving both query and instruction to the retrieval models. We make small changes to the narratives and isolate how well retrieval models can follow these nuanced changes. We present results for both multilingual (XX-XX) and cross-lingual (En-XX) performance. We see strong cross-lingual performance with English-based retrievers that trained using instructions, but find a notable drop in performance in the multilingual setting, indicating that more work is needed in developing data for instruction-based multilingual retrievers.

mFollowIR: a Multilingual Benchmark for Instruction Following in Retrieval

TL;DR

The paper addresses the problem of evaluating instruction-following in retrieval across languages. It introduces mFollowIR, a multilingual benchmark built on NeuCLIR narratives in Russian, Chinese, and Persian, and uses controlled narrative edits to isolate instruction-following via a reranking setup measured with and -MRR. It demonstrates that English-instruction-trained retrievers offer cross-lingual transfer benefits but multilingual performance remains weaker, especially for bi-encoders, while cross-encoders and instruction-tuned models show the strongest instruction-following signals. The work contributes a new multilingual dataset, a rigorous evaluation framework, and insights into how instruction data and model scale influence multilingual retrieval, highlighting direction for future improvement in multilingual instruction-following capabilities.

Abstract

Retrieval systems generally focus on web-style queries that are short and underspecified. However, advances in language models have facilitated the nascent rise of retrieval models that can understand more complex queries with diverse intents. However, these efforts have focused exclusively on English; therefore, we do not yet understand how they work across languages. We introduce mFollowIR, a multilingual benchmark for measuring instruction-following ability in retrieval models. mFollowIR builds upon the TREC NeuCLIR narratives (or instructions) that span three diverse languages (Russian, Chinese, Persian) giving both query and instruction to the retrieval models. We make small changes to the narratives and isolate how well retrieval models can follow these nuanced changes. We present results for both multilingual (XX-XX) and cross-lingual (En-XX) performance. We see strong cross-lingual performance with English-based retrievers that trained using instructions, but find a notable drop in performance in the multilingual setting, indicating that more work is needed in developing data for instruction-based multilingual retrievers.

Paper Structure

This paper contains 33 sections, 1 equation, 1 figure, 4 tables.

Figures (1)

  • Figure 1: A visual depiction of the pairwise evaluation framework using p-MRR. Left: the original instruction (narrative) is changed to be more specific, making some previously relevant documents newly non-relevant (e.g. Doc A). Right: the model is then evaluated on both the original instruction and the changed instruction (along with the query for both). Relevant docs are in blue, non-relevant documents are in red; note that Doc A is relevant for the original instruction but not the changed. p-MRR calculates whether these newly non-relevant documents decreased in rank (in this case, going from Rank 1 to Rank 3 correctly). If the newly non-relevant documents correctly decrease in rank, p-MRR has a positive score (up to 1.0), whereas if the rank increases they have a negative score (down to -1.0), and if there is no change the score is 0 (see §\ref{['sec:evaluation']})