Table of Contents
Fetching ...

Pessimistic Evaluation

Fernando Diaz

TL;DR

It is argued that evaluating only with average metric measurements assumes utilitarian values not aligned with traditions of information access based on equal access, and advocate for pessimistic evaluation of information access systems focusing on worst case utility.

Abstract

Traditional evaluation of information access systems has focused primarily on average utility across a set of information needs (information retrieval) or users (recommender systems). In this work, we argue that evaluating only with average metric measurements assumes utilitarian values not aligned with traditions of information access based on equal access. We advocate for pessimistic evaluation of information access systems focusing on worst case utility. These methods are (a) grounded in ethical and pragmatic concepts, (b) theoretically complementary to existing robustness and fairness methods, and (c) empirically validated across a set of retrieval and recommendation tasks. These results suggest that pessimistic evaluation should be included in existing experimentation processes to better understand the behavior of systems, especially when concerned with principles of social good.

Pessimistic Evaluation

TL;DR

It is argued that evaluating only with average metric measurements assumes utilitarian values not aligned with traditions of information access based on equal access, and advocate for pessimistic evaluation of information access systems focusing on worst case utility.

Abstract

Traditional evaluation of information access systems has focused primarily on average utility across a set of information needs (information retrieval) or users (recommender systems). In this work, we argue that evaluating only with average metric measurements assumes utilitarian values not aligned with traditions of information access based on equal access. We advocate for pessimistic evaluation of information access systems focusing on worst case utility. These methods are (a) grounded in ethical and pragmatic concepts, (b) theoretically complementary to existing robustness and fairness methods, and (c) empirically validated across a set of retrieval and recommendation tasks. These results suggest that pessimistic evaluation should be included in existing experimentation processes to better understand the behavior of systems, especially when concerned with principles of social good.

Paper Structure

This paper contains 33 sections, 10 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Ranking of runs in three TREC tracks (see Table \ref{['tab:data']}) according to associated metrics when ordering systems by average performance (horizontal) and leximin (vertical). $\diamond$: change in position less than one quintile. $\circ$: change in position between one and two quintiles. $\bullet$: change in position greater than two quintiles. Red: degradation in ranking. Blue: improvement in rank position. Black: no change in rank position.
  • Figure 2: Kendall's $\tau$ between system orderings by smoothed leximin and arithmetic mean (solid) and leximin (dashed) using average precision on Robust 2004 (see Section \ref{['sec:experiments:data']} for details).
  • Figure 3: Discretization of average precision values for Robust 2004. (a) Metric values below the threshold set to zero. (b) Metric values quantized to a specific number of significant digits.