Replicability Measures for Longitudinal Information Retrieval Evaluation

Jüri Keller; Timo Breuer; Philipp Schaer

Replicability Measures for Longitudinal Information Retrieval Evaluation

Jüri Keller, Timo Breuer, Philipp Schaer

TL;DR

It was found that the most effective systems are not necessarily the ones with the most persistent performance, as the ranking of systems varies across retrieval measures and time.

Abstract

Information Retrieval (IR) systems are exposed to constant changes in most components. Documents are created, updated, or deleted, the information needs are changing, and even relevance might not be static. While it is generally expected that the IR systems retain a consistent utility for the users, test collection evaluations rely on a fixed experimental setup. Based on the LongEval shared task and test collection, this work explores how the effectiveness measured in evolving experiments can be assessed. Specifically, the persistency of effectiveness is investigated as a replicability task. It is observed how the effectiveness progressively deteriorates over time compared to the initial measurement. Employing adapted replicability measures provides further insight into the persistence of effectiveness. The ranking of systems varies across retrieval measures and time. In conclusion, it was found that the most effective systems are not necessarily the ones with the most persistent performance.

Replicability Measures for Longitudinal Information Retrieval Evaluation

TL;DR

It was found that the most effective systems are not necessarily the ones with the most persistent performance, as the ranking of systems varies across retrieval measures and time.

Abstract

Paper Structure (11 sections, 5 equations, 4 figures, 1 table)

This paper contains 11 sections, 5 equations, 4 figures, 1 table.

Introduction
Related Work
Temporal Replicability
Experimental Evaluation
LongEval Test Collection
Experimental Systems
Results
Discussion and Limitations
Conclusion
Acknowledgments.
Disclosure of Interests.

Figures (4)

Figure 1: The evolution of the LongEval test collection documents across the three sub-collections. Over time, documents are added, removed, or updated. All documents were harmonized by their URLs.
Figure 2: The P@10, bpref, and nDCG results based on the core queries.
Figure 3: The plotted against the for the replication to (left) and to (right). The for E5 is excluded as an outlier.
Figure 4: RRF $\Delta nDCG$ results per topic for to . The topics are ordered according to the delta.

Replicability Measures for Longitudinal Information Retrieval Evaluation

TL;DR

Abstract

Replicability Measures for Longitudinal Information Retrieval Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)