Table of Contents
Fetching ...

Performance Evaluation in Multimedia Retrieval

Loris Sauter, Ralph Gasser, Heiko Schuldt, Abraham Bernstein, Luca Rossetto

TL;DR

This work addresses the reproducibility and comparability challenges of multimedia retrieval evaluation by introducing a formal model with three phases (definition, execution, analysis) and a data-task-user pillar framework, enabling consistent treatment of both interactive and non-interactive settings. It formalizes task descriptions, relevance judgments, and evaluation templates, and defines task- and evaluation-level metrics to support rigorous analysis. The authors present the Distributed Retrieval Evaluation Server (DRES), an open-source, modular infrastructure (backend in Kotlin, frontend in TypeScript) that supports synchronous, asynchronous, distributed, and non-interactive evaluations through REST APIs and a flexible data model. By mapping the model to real-world campaigns like VBS and LSC and detailing practical usage scenarios, the paper demonstrates how standardized evaluation can reduce hurdles, improve reproducibility, and scale large-scale multimedia retrieval experiments.

Abstract

Performance evaluation in multimedia retrieval, as in the information retrieval domain at large, relies heavily on retrieval experiments, employing a broad range of techniques and metrics. These can involve human-in-the-loop and machine-only settings for the retrieval process itself and the subsequent verification of results. Such experiments can be elaborate and use-case-specific, which can make them difficult to compare or replicate. In this paper, we present a formal model to express all relevant aspects of such retrieval experiments, as well as a flexible open-source evaluation infrastructure that implements the model. These contributions intend to make a step towards lowering the hurdles for conducting retrieval experiments and improving their reproducibility.

Performance Evaluation in Multimedia Retrieval

TL;DR

This work addresses the reproducibility and comparability challenges of multimedia retrieval evaluation by introducing a formal model with three phases (definition, execution, analysis) and a data-task-user pillar framework, enabling consistent treatment of both interactive and non-interactive settings. It formalizes task descriptions, relevance judgments, and evaluation templates, and defines task- and evaluation-level metrics to support rigorous analysis. The authors present the Distributed Retrieval Evaluation Server (DRES), an open-source, modular infrastructure (backend in Kotlin, frontend in TypeScript) that supports synchronous, asynchronous, distributed, and non-interactive evaluations through REST APIs and a flexible data model. By mapping the model to real-world campaigns like VBS and LSC and detailing practical usage scenarios, the paper demonstrates how standardized evaluation can reduce hurdles, improve reproducibility, and scale large-scale multimedia retrieval experiments.

Abstract

Performance evaluation in multimedia retrieval, as in the information retrieval domain at large, relies heavily on retrieval experiments, employing a broad range of techniques and metrics. These can involve human-in-the-loop and machine-only settings for the retrieval process itself and the subsequent verification of results. Such experiments can be elaborate and use-case-specific, which can make them difficult to compare or replicate. In this paper, we present a formal model to express all relevant aspects of such retrieval experiments, as well as a flexible open-source evaluation infrastructure that implements the model. These contributions intend to make a step towards lowering the hurdles for conducting retrieval experiments and improving their reproducibility.

Paper Structure

This paper contains 30 sections, 10 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: State-machine representation of the evaluation execution phase. The preceding (preparation) and following (analysis) have been omitted for the sake of brevity.
  • Figure 2: Illustration of an information need description along different media channels. Rectangles indicate the usage of a channel with a given start and end time. The concept of media channels for task descriptions limits the powerset returned by the task description function described in \ref{['eq:desc']}.
  • Figure 3: Overview of DRES' architecture and system components, which include the backend and frontend as well as a persistence layer.
  • Figure 4: Overview of DRES' data model. Only the most important entities are illustrated. Entities are grouped as follows: densely dashed group entities related to the agents, entities in the dotted group related to templating (evaluation definition), loosely dashed group entities related to the evaluation execution, dash and dot group entities used to model submissions, and the crossed groups entities representing the test collection.
  • Figure 5: Screenshots of DRES' evaluation editor, which facilitates the management of evaluation templates.
  • ...and 1 more figures