Performance Evaluation in Multimedia Retrieval
Loris Sauter, Ralph Gasser, Heiko Schuldt, Abraham Bernstein, Luca Rossetto
TL;DR
This work addresses the reproducibility and comparability challenges of multimedia retrieval evaluation by introducing a formal model with three phases (definition, execution, analysis) and a data-task-user pillar framework, enabling consistent treatment of both interactive and non-interactive settings. It formalizes task descriptions, relevance judgments, and evaluation templates, and defines task- and evaluation-level metrics to support rigorous analysis. The authors present the Distributed Retrieval Evaluation Server (DRES), an open-source, modular infrastructure (backend in Kotlin, frontend in TypeScript) that supports synchronous, asynchronous, distributed, and non-interactive evaluations through REST APIs and a flexible data model. By mapping the model to real-world campaigns like VBS and LSC and detailing practical usage scenarios, the paper demonstrates how standardized evaluation can reduce hurdles, improve reproducibility, and scale large-scale multimedia retrieval experiments.
Abstract
Performance evaluation in multimedia retrieval, as in the information retrieval domain at large, relies heavily on retrieval experiments, employing a broad range of techniques and metrics. These can involve human-in-the-loop and machine-only settings for the retrieval process itself and the subsequent verification of results. Such experiments can be elaborate and use-case-specific, which can make them difficult to compare or replicate. In this paper, we present a formal model to express all relevant aspects of such retrieval experiments, as well as a flexible open-source evaluation infrastructure that implements the model. These contributions intend to make a step towards lowering the hurdles for conducting retrieval experiments and improving their reproducibility.
