Table of Contents
Fetching ...

T-Retrievability: A Topic-Focused Approach to Measure Fair Document Exposure in Information Retrieval

Xuejun Chang, Zaiqiao Meng, Debasis Ganguly

TL;DR

The paper addresses exposure fairness in information retrieval by showing that traditional collection-level retrievability conflates topical relevance priors with access. It introduces Topical-Retrievability (T-Retrievability), a localized measure that computes retrievability over groups of topically related queries and aggregates these scores to a collection-level statistic using Gini-based exposure fairness. The method replaces the cut-off dependent retrievability with a rank-based formulation $r(D, \mathcal{C}, \mathcal{Q}, \theta) = \frac{1}{|\mathcal{Q}|} \sum_{Q \in \mathcal{Q}} \frac{1}{\log(1+\rho(D;Q, \theta))}$ and leverages real user queries from MS MARCO dev, grouping queries via K-means on both sparse (TF-IDF) and dense (SBERT) representations into $K$ topical clusters. By computing $r(D, \mathcal{C}, \mathcal{Q}_i, \theta)$ for each topic, deriving per-topic Gini coefficients, and aggregating with min/avg/max, the paper demonstrates that localized analysis reveals nuanced exposure fairness patterns that collection-level measures miss. Experiments on BM25, SPLADE, TCT-ColBERT, and reranked variants on MS MARCO show substantial variation in exposure fairness across models and topic granularities, underscoring the value of topic-focused auditing for fair document exposure in IR systems.

Abstract

Retrievability of a document is a collection-based statistic that measures its expected (reciprocal) rank of being retrieved within a specific rank cut-off. A collection with uniformly distributed retrievability scores across documents is an indicator of fair document exposure. While retrievability scores have been used to quantify the fairness of exposure for a collection, in our work, we use the distribution of retrievability scores to measure the exposure bias of retrieval models. We hypothesise that an uneven distribution of retrievability scores across the entire collection may not accurately reflect exposure bias but rather indicate variations in topical relevance. As a solution, we propose a topic-focused localised retrievability measure, which we call \textit{T-Retrievability} (topic-retrievability), which first computes retrievability scores over multiple groups of topically-related documents, and then aggregates these localised values to obtain the collection-level statistics. Our analysis using this proposed T-Retrievability measure uncovers new insights into the exposure characteristics of various neural ranking models. The findings suggest that this localised measure provides a more nuanced understanding of exposure fairness, offering a more reliable approach for assessing document accessibility in IR systems.

T-Retrievability: A Topic-Focused Approach to Measure Fair Document Exposure in Information Retrieval

TL;DR

The paper addresses exposure fairness in information retrieval by showing that traditional collection-level retrievability conflates topical relevance priors with access. It introduces Topical-Retrievability (T-Retrievability), a localized measure that computes retrievability over groups of topically related queries and aggregates these scores to a collection-level statistic using Gini-based exposure fairness. The method replaces the cut-off dependent retrievability with a rank-based formulation and leverages real user queries from MS MARCO dev, grouping queries via K-means on both sparse (TF-IDF) and dense (SBERT) representations into topical clusters. By computing for each topic, deriving per-topic Gini coefficients, and aggregating with min/avg/max, the paper demonstrates that localized analysis reveals nuanced exposure fairness patterns that collection-level measures miss. Experiments on BM25, SPLADE, TCT-ColBERT, and reranked variants on MS MARCO show substantial variation in exposure fairness across models and topic granularities, underscoring the value of topic-focused auditing for fair document exposure in IR systems.

Abstract

Retrievability of a document is a collection-based statistic that measures its expected (reciprocal) rank of being retrieved within a specific rank cut-off. A collection with uniformly distributed retrievability scores across documents is an indicator of fair document exposure. While retrievability scores have been used to quantify the fairness of exposure for a collection, in our work, we use the distribution of retrievability scores to measure the exposure bias of retrieval models. We hypothesise that an uneven distribution of retrievability scores across the entire collection may not accurately reflect exposure bias but rather indicate variations in topical relevance. As a solution, we propose a topic-focused localised retrievability measure, which we call \textit{T-Retrievability} (topic-retrievability), which first computes retrievability scores over multiple groups of topically-related documents, and then aggregates these localised values to obtain the collection-level statistics. Our analysis using this proposed T-Retrievability measure uncovers new insights into the exposure characteristics of various neural ranking models. The findings suggest that this localised measure provides a more nuanced understanding of exposure fairness, offering a more reliable approach for assessing document accessibility in IR systems.

Paper Structure

This paper contains 15 sections, 3 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Variations in document exposure fairness (as measured by Gini coefficients of T-retrievability) for different granularity of topics (query groups) obtained with K-means on dense representations of the queries.
  • Figure 2: Similar to Figure \ref{['fig:aggr_gini_scikit_dense']} -- the only difference being that sparse (tf-idf) representation is used to cluster the queries.