How Much Freedom Does An Effectiveness Metric Really Have?

Alistair Moffat; Joel Mackenzie

How Much Freedom Does An Effectiveness Metric Really Have?

Alistair Moffat, Joel Mackenzie

TL;DR

IPSO addresses whether innate pairwise SERP orderings constrain metric-based score ordering, proposing a metric-agnostic Sign-test corroboration for system-vs-system comparisons. The study formalizes two universal ordering rules for SERP pairs and demonstrates that a substantial fraction of $k$-SERP pairs exhibit fixed innate orderings, enabling cross-metric generalization. It introduces IPSO as a practical framework to corroborate a single, user-model-aligned metric across other metrics and validates the approach on the 2004 TREC Robust data with a concrete reporting workflow and limitations. Together, these results offer a principled path to robust IR evaluation by hedging metric-specific conclusions with metric-agnostic corroboration.

Abstract

It is tempting to assume that because effectiveness metrics have free choice to assign scores to search engine result pages (SERPs) there must thus be a similar degree of freedom as to the relative order that SERP pairs can be put into. In fact that second freedom is, to a considerable degree, illusory. That's because if one SERP in a pair has been given a certain score by a metric, fundamental ordering constraints in many cases then dictate that the score for the second SERP must be either not less than, or not greater than, the score assigned to the first SERP. We refer to these fixed relationships as innate pairwise SERP orderings. Our first goal in this work is to describe and defend those pairwise SERP relationship constraints, and tabulate their relative occurrence via both exhaustive and empirical experimentation. We then consider how to employ such innate pairwise relationships in IR experiments, leading to a proposal for a new measurement paradigm. Specifically, we argue that tables of results in which many different metrics are listed for champion versus challenger system comparisons should be avoided; and that instead a single metric be argued for in principled terms, with any relationships identified by that metric then reinforced via an assessment of the innate relationship as to whether other metrics - indeed, all other metrics - are likely to yield the same system-vs-system outcome.

How Much Freedom Does An Effectiveness Metric Really Have?

TL;DR

-SERP pairs exhibit fixed innate orderings, enabling cross-metric generalization. It introduces IPSO as a practical framework to corroborate a single, user-model-aligned metric across other metrics and validates the approach on the 2004 TREC Robust data with a concrete reporting workflow and limitations. Together, these results offer a principled path to robust IR evaluation by hedging metric-specific conclusions with metric-agnostic corroboration.

Abstract

Paper Structure (5 sections, 8 figures, 3 tables, 1 algorithm)

This paper contains 5 sections, 8 figures, 3 tables, 1 algorithm.

Introduction
Innate Pairwise SERP Orderings
A New Perspective -- IPSO
Related Work
Conclusion

Figures (8)

Figure 1: Hasse diagram illustrating the innate non-inferiority relationships amongst SERPs of length $k=3$.
Figure 2: All SERP pairs of length $k=3$, and the relationships between them. Red cells indicate non-separability.
Figure 3: All SERP pairs of length $k=6$, plotted using the same colors as in Figure \ref{['fig-rbp3']}, with each grid consisting of $2^6\times 2^6=4{,}096$ cells. Each pane has exactly the same number of red cells; they indicate SERP pairs where the two metrics are permitted to disagree on the relative ordering of the two SERPs, and are not in any way an indication that they do disagree.
Figure 4: Comparing SERPs to get an innate pairwise ordering. In this example, either S1 or S2 might be assigned the higher score for metrics computed to depth $k = 10$, possible because the two SERPs are non-separable. But at depths $k=4$ through to $k=8$ SERP S1 cannot be assigned a lower effectiveness score than SERP S2.
Figure 5: Comparison to depth $k=10$ between two Robust runs over Topics $301$--$325$. In the final six columns dark blue entries indicate equality of metric scores; yellow entries indicate that System $A$ scores more highly; and light blue values indicate that System $B$ scores more highly. The values in those cells are the score differences, $M(A)-M(B)$.
...and 3 more figures

How Much Freedom Does An Effectiveness Metric Really Have?

TL;DR

Abstract

How Much Freedom Does An Effectiveness Metric Really Have?

Authors

TL;DR

Abstract

Table of Contents

Figures (8)