Fréchet Distance for Offline Evaluation of Information Retrieval Systems with Sparse Labels

Negar Arabzadeh; Charles L. A. Clarke

Fréchet Distance for Offline Evaluation of Information Retrieval Systems with Sparse Labels

Negar Arabzadeh, Charles L. A. Clarke

TL;DR

This work proposes leveraging the Fréchet Distance to measure the distance between the distributions of relevant judged items and retrieved results, taking inspiration from the success of using Fréchet Inception Distance (FID) in assessing text-to-image generation systems.

Abstract

The rapid advancement of natural language processing, information retrieval (IR), computer vision, and other technologies has presented significant challenges in evaluating the performance of these systems. One of the main challenges is the scarcity of human-labeled data, which hinders the fair and accurate assessment of these systems. In this work, we specifically focus on evaluating IR systems with sparse labels, borrowing from recent research on evaluating computer vision tasks. taking inspiration from the success of using Fréchet Inception Distance (FID) in assessing text-to-image generation systems. We propose leveraging the Fréchet Distance to measure the distance between the distributions of relevant judged items and retrieved results. Our experimental results on MS MARCO V1 dataset and TREC Deep Learning Tracks query sets demonstrate the effectiveness of the Fréchet Distance as a metric for evaluating IR systems, particularly in settings where a few labels are available. This approach contributes to the advancement of evaluation methodologies in real-world scenarios such as the assessment of generative IR systems.

Fréchet Distance for Offline Evaluation of Information Retrieval Systems with Sparse Labels

TL;DR

Abstract

Paper Structure (17 sections, 4 equations, 2 figures, 5 tables)

This paper contains 17 sections, 4 equations, 2 figures, 5 tables.

Introduction
Fréchet Distance for IR evalaution
Fréchet Distance
Fréchet Inception Distance
Fréchet Distance for IR
Experimental Setup
Dataset and Query sets
Retrieval models
Embeddings
Assessment with Sparse labels
Assessing with Comprehensive labels
Assessing Unlabeled Retrieved Results
Further analysis
Correlation with IR Evaluation Metrics
Impact of Document Representation
...and 2 more sections

Figures (2)

Figure 1: Performance of bootstrap sampling (N=1000) of queries in MS MARCO dev set in terms of MRR@10 and $\textit{FD}@10$ for the 12 different retrieval methods.
Figure 2: Performance of all the submitted runs to TREC DL 2019 (first row) and TREC DL 2020 (second row). In each sub-figure, X-axis and Y-axis indicate nDCG@10 and $\textit{FD}@10$ respectively. $\textit{FD}@10$ was measured with 1,5 and 10 relevant items per query in the sub-figures in the first, second and third columns respectively.

Fréchet Distance for Offline Evaluation of Information Retrieval Systems with Sparse Labels

TL;DR

Abstract

Fréchet Distance for Offline Evaluation of Information Retrieval Systems with Sparse Labels

Authors

TL;DR

Abstract

Table of Contents

Figures (2)