Table of Contents
Fetching ...

Recall, Robustness, and Lexicographic Evaluation

Fernando Diaz, Michael D. Ekstrand, Bhaskar Mitra

TL;DR

This work formalizes recall in rankings via recall-orientation, robustness, and lexicographic evaluation, introducing Total Search Efficiency (TSE) as a principled recall-focused metric and Lexicographic Recall (LR) to address ties and improve discriminative power. It connects recall to robustness by showing that the worst-case user or provider performance corresponds to TSE, aligning with Rawlsian fairness concepts. The empirical analysis across 3 recommendation tasks and 17 information retrieval tasks demonstrates that LR preserves correlation with existing recall metrics while offering greater sensitivity and stability under missing labels. The findings advocate adopting recall-oriented, robustness-aware evaluation (via TSE and LR) to improve fairness, reliability, and interpretability in ranking systems. The approach provides practical guidance for data labeling, depth of evaluation, and potential algorithmic directions toward stochastic ranking mechanisms. The work thus deepens the theoretical and empirical understanding of recall and its links to robustness and fairness in modern information access systems.

Abstract

Although originally developed to evaluate sets of items, recall is often used to evaluate rankings of items, including those produced by recommender, retrieval, and other machine learning systems. The application of recall without a formal evaluative motivation has led to criticism of recall as a vague or inappropriate measure. In light of this debate, we reflect on the measurement of recall in rankings from a formal perspective. Our analysis is composed of three tenets: recall, robustness, and lexicographic evaluation. First, we formally define `recall-orientation' as the sensitivity of a metric to a user interested in finding every relevant item. Second, we analyze recall-orientation from the perspective of robustness with respect to possible content consumers and providers, connecting recall to recent conversations about fair ranking. Finally, we extend this conceptual and theoretical treatment of recall by developing a practical preference-based evaluation method based on lexicographic comparison. Through extensive empirical analysis across three recommendation tasks and 17 information retrieval tasks, we establish that our new evaluation method, lexirecall, has convergent validity (i.e., it is correlated with existing recall metrics) and exhibits substantially higher sensitivity in terms of discriminative power and stability in the presence of missing labels. Our conceptual, theoretical, and empirical analysis substantially deepens our understanding of recall and motivates its adoption through connections to robustness and fairness.

Recall, Robustness, and Lexicographic Evaluation

TL;DR

This work formalizes recall in rankings via recall-orientation, robustness, and lexicographic evaluation, introducing Total Search Efficiency (TSE) as a principled recall-focused metric and Lexicographic Recall (LR) to address ties and improve discriminative power. It connects recall to robustness by showing that the worst-case user or provider performance corresponds to TSE, aligning with Rawlsian fairness concepts. The empirical analysis across 3 recommendation tasks and 17 information retrieval tasks demonstrates that LR preserves correlation with existing recall metrics while offering greater sensitivity and stability under missing labels. The findings advocate adopting recall-oriented, robustness-aware evaluation (via TSE and LR) to improve fairness, reliability, and interpretability in ranking systems. The approach provides practical guidance for data labeling, depth of evaluation, and potential algorithmic directions toward stochastic ranking mechanisms. The work thus deepens the theoretical and empirical understanding of recall and its links to robustness and fairness in modern information access systems.

Abstract

Although originally developed to evaluate sets of items, recall is often used to evaluate rankings of items, including those produced by recommender, retrieval, and other machine learning systems. The application of recall without a formal evaluative motivation has led to criticism of recall as a vague or inappropriate measure. In light of this debate, we reflect on the measurement of recall in rankings from a formal perspective. Our analysis is composed of three tenets: recall, robustness, and lexicographic evaluation. First, we formally define `recall-orientation' as the sensitivity of a metric to a user interested in finding every relevant item. Second, we analyze recall-orientation from the perspective of robustness with respect to possible content consumers and providers, connecting recall to recent conversations about fair ranking. Finally, we extend this conceptual and theoretical treatment of recall by developing a practical preference-based evaluation method based on lexicographic comparison. Through extensive empirical analysis across three recommendation tasks and 17 information retrieval tasks, we establish that our new evaluation method, lexirecall, has convergent validity (i.e., it is correlated with existing recall metrics) and exhibits substantially higher sensitivity in terms of discriminative power and stability in the presence of missing labels. Our conceptual, theoretical, and empirical analysis substantially deepens our understanding of recall and motivates its adoption through connections to robustness and fairness.
Paper Structure (55 sections, 21 theorems, 52 equations, 19 figures, 8 tables)

This paper contains 55 sections, 21 theorems, 52 equations, 19 figures, 8 tables.

Key Result

theorem 1

$\mu$ is monotonically increasing in retrieval size.

Figures (19)

  • Figure 1: Percentage of full papers published at the ACM Conference on Recommender Systems that measure recall in experiments. Percentages are macro-averaged across years to control for growth in the number of full papers in the conference proceedings. Details can be found in Appendix \ref{['app:recsys-recall']}.
  • Figure 2: Both recall and robustness as a theoretical construct can be conceptualized in multiple ways zobel:against-recalldrenkow:robustness. These two constructs are equivalent when conceptualized as totality (recall) and worst-case analysis (robustness). Lexicographic recall is located in this intersection.
  • Figure 3: Relevance Projection with Imputation. Given a set of relevant item ids $\mathcal{R}=\{083,107,511,668,820,977\}$, relevance projection of an incomplete top-10 ranking $\pi$ and relevant set $\mathcal{R}$ to a $m\times 1$ vector of positions $p$. We also show the inverse projection vector $\overline{p}$ of items at specific recall levels.
  • Figure 4: Metric orientation. Each ranking $\pi$ of ten items represented with a vector of cells ordered from top to bottom with shaded cells representing relevant items $\mathcal{R}$. Precision orientation (left) measures the degradation in a metric when the highest ranked relevant item is moved to the bottom of the ranking while holding all other positions fixed. Recall orientation (right) measures the degradation in a metric when the lowest ranked relevant item is moved to the bottom of the ranking while holding all other positions fixed. We measure the precision and recall orientation of a metric $\mu$ by the difference between $\mu(\pi,\mathcal{R})-\mu(\pi',\mathcal{R})$.
  • Figure 5: Metric orientation of ranking metrics of $10^5$ items with $m\in[1\mathrel{{.}\,{.}} 15]$ relevant items. The vertical axis reflects, for $m\in[1,15]$, the change in metric value when (\ref{['fig:metric-orientation:results:precision']}) moving top-ranked item from position 1 to position $n-m+1$ or (\ref{['fig:metric-orientation:results:recall']}) moving bottom-ranked item from position $m$ to position $n$. The values for TSE (Equation \ref{['eq:tse']}) are scaled by the lower and upper bound given a fixed $m$ and therefore apply to any exposure model. See Figure \ref{['fig:metric-orientation']} for details. This figure best rendered in color.
  • ...and 14 more figures

Theorems & Definitions (44)

  • definition 1
  • definition 2
  • theorem 1
  • proof
  • theorem 2
  • proof
  • theorem 3
  • proof
  • theorem 4
  • proof
  • ...and 34 more