Table of Contents
Fetching ...

SuiteEval: Simplifying Retrieval Benchmarks

Andrew Parry, Debasis Ganguly, Sean MacAvaney

TL;DR

This work introduces SuiteEval, a unified framework that offers automatic end-to-end evaluation, dynamic indexing that reuses on-disk indices to minimise disk usage, and built-in support for major benchmarks (BEIR, LoTTE, MS MARCO, NanoBEIR, and BRIGHT).

Abstract

Information retrieval evaluation often suffers from fragmented practices -- varying dataset subsets, aggregation methods, and pipeline configurations -- that undermine reproducibility and comparability, especially for foundation embedding models requiring robust out-of-domain performance. We introduce SuiteEval, a unified framework that offers automatic end-to-end evaluation, dynamic indexing that reuses on-disk indices to minimise disk usage, and built-in support for major benchmarks (BEIR, LoTTE, MS MARCO, NanoBEIR, and BRIGHT). Users only need to supply a pipeline generator. SuiteEval handles data loading, indexing, ranking, metric computation, and result aggregation. New benchmark suites can be added in a single line. SuiteEval reduces boilerplate and standardises evaluations to facilitate reproducible IR research, as a broader benchmark set is increasingly required.

SuiteEval: Simplifying Retrieval Benchmarks

TL;DR

This work introduces SuiteEval, a unified framework that offers automatic end-to-end evaluation, dynamic indexing that reuses on-disk indices to minimise disk usage, and built-in support for major benchmarks (BEIR, LoTTE, MS MARCO, NanoBEIR, and BRIGHT).

Abstract

Information retrieval evaluation often suffers from fragmented practices -- varying dataset subsets, aggregation methods, and pipeline configurations -- that undermine reproducibility and comparability, especially for foundation embedding models requiring robust out-of-domain performance. We introduce SuiteEval, a unified framework that offers automatic end-to-end evaluation, dynamic indexing that reuses on-disk indices to minimise disk usage, and built-in support for major benchmarks (BEIR, LoTTE, MS MARCO, NanoBEIR, and BRIGHT). Users only need to supply a pipeline generator. SuiteEval handles data loading, indexing, ranking, metric computation, and result aggregation. New benchmark suites can be added in a single line. SuiteEval reduces boilerplate and standardises evaluations to facilitate reproducible IR research, as a broader benchmark set is increasingly required.
Paper Structure (4 sections, 3 figures, 2 tables)

This paper contains 4 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Creation of a PisaIndex using BM25 before re-ranking with two neural models.
  • Figure 2: Execution of the BEIR evaluation suite. MonoT5 is taken to be the baseline for significance tests and run files are saved to "beir_results".
  • Figure 3: Definition of a custom suite comprising two corpora (MSMARCOv1 and MSMARCOv2) and 3 test collections (DL-2019, -2020, -2022). This suite will return nDCG@10 values for each test collection and the geometric mean of the three for each pipeline defined in the systems function.