Language Fairness in Multilingual Information Retrieval

Eugene Yang; Thomas Jänich; James Mayfield; Dawn Lawrie

Language Fairness in Multilingual Information Retrieval

Eugene Yang, Thomas Jänich, James Mayfield, Dawn Lawrie

TL;DR

The paper addresses fairness in multilingual information retrieval by introducing PEER, a language-aware fairness metric based on the Kruskal-Wallis test. PEER evaluates whether documents in different languages with the same relevance level tend to occupy similar ranks, avoiding reliance on a predefined target distribution. By computing per-relevance-level $p$-values and aggregating them with weights into $\text{PEER}^{(q)}$ and $\text{PEER}@X$, the authors provide a robust, non-parametric measure that handles rank cutoffs and untranslated/unretrieved documents. Empirical results on synthetic data, language-labeling experiments, and real MLIR benchmarks (CLEF 2003 and NeuCLIR 2022) show that PEER aligns with fairness intuitions and often distinguishes language bias better than prior metrics such as AWRF or $\alpha$-nDCG, highlighting its practical value for MLIR evaluation and system development.

Abstract

Multilingual information retrieval (MLIR) considers the problem of ranking documents in several languages for a query expressed in a language that may differ from any of those languages. Recent work has observed that approaches such as combining ranked lists representing a single document language each or using multilingual pretrained language models demonstrate a preference for one language over others. This results in systematic unfair treatment of documents in different languages. This work proposes a language fairness metric to evaluate whether documents across different languages are fairly ranked through statistical equivalence testing using the Kruskal-Wallis test. In contrast to most prior work in group fairness, we do not consider any language to be an unprotected group. Thus our proposed measure, PEER (Probability of EqualExpected Rank), is the first fairness metric specifically designed to capture the language fairness of MLIR systems. We demonstrate the behavior of PEER on artificial ranked lists. We also evaluate real MLIR systems on two publicly available benchmarks and show that the PEER scores align with prior analytical findings on MLIR fairness. Our implementation is compatible with ir-measures and is available at http://github.com/hltcoe/peer_measure.

Language Fairness in Multilingual Information Retrieval

TL;DR

-values and aggregating them with weights into

and

, the authors provide a robust, non-parametric measure that handles rank cutoffs and untranslated/unretrieved documents. Empirical results on synthetic data, language-labeling experiments, and real MLIR benchmarks (CLEF 2003 and NeuCLIR 2022) show that PEER aligns with fairness intuitions and often distinguishes language bias better than prior metrics such as AWRF or

-nDCG, highlighting its practical value for MLIR evaluation and system development.

Abstract

Paper Structure (11 sections, 3 equations, 1 figure, 1 table)

This paper contains 11 sections, 3 equations, 1 figure, 1 table.

Introduction
Related Work
Probability of Equal Expected Rank
Fairness through Hypothesis Testing
Fairness at Each Relevance Level
Rank Cutoff and Aggregation
Experiments and Results
Synthetic Data
Assigning Languages to a Real Ranked List
Real MLIR Systems
Summary

Figures (1)

Figure 1: Ranked lists with different fairness patterns between two languages and binary relevance.

Language Fairness in Multilingual Information Retrieval

TL;DR

Abstract

Language Fairness in Multilingual Information Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (1)