Table of Contents
Fetching ...

On Precomputation and Caching in Information Retrieval Experiments with Pipeline Architectures

Sean MacAvaney, Craig Macdonald

TL;DR

This work tackles the reproducibility and efficiency challenges of information retrieval experiments that use multi-stage pipelines. It introduces two complementary strategies: automatic prefix precomputation to share work across related pipelines and explicit caching via the pyterrier-caching package, which provides KeyValueCache, ScorerCache, RetrieverCache, and IndexerCache. Demonstration experiments on MSMARCO v1 and v2 show substantial reductions in compute time, validating the approach and its potential to enable end-to-end pipelines with reusable caches. Overall, the work contributes practical techniques that promote GreenIR, improve experimental workflow, and facilitate collaboration by making cached results easily shareable.

Abstract

Modern information retrieval systems often rely on multiple components executed in a pipeline. In a research setting, this can lead to substantial redundant computations (e.g., retrieving the same query multiple times for evaluating different downstream rerankers). To overcome this, researchers take cached "result" files as inputs, which represent the output of another pipeline. However, these result files can be brittle and can cause a disconnect between the conceptual design of the pipeline and its logical implementation. To overcome both the redundancy problem (when executing complete pipelines) and the disconnect problem (when relying on intermediate result files), we describe our recent efforts to improve the caching capabilities in the open-source PyTerrier IR platform. We focus on two main directions: (1) automatic implicit caching of common pipeline prefixes when comparing systems and (2) explicit caching of operations through a new extension package, pyterrier-caching. These approaches allow for the best of both worlds: pipelines can be fully expressed end-to-end, while also avoiding redundant computations between pipelines.

On Precomputation and Caching in Information Retrieval Experiments with Pipeline Architectures

TL;DR

This work tackles the reproducibility and efficiency challenges of information retrieval experiments that use multi-stage pipelines. It introduces two complementary strategies: automatic prefix precomputation to share work across related pipelines and explicit caching via the pyterrier-caching package, which provides KeyValueCache, ScorerCache, RetrieverCache, and IndexerCache. Demonstration experiments on MSMARCO v1 and v2 show substantial reductions in compute time, validating the approach and its potential to enable end-to-end pipelines with reusable caches. Overall, the work contributes practical techniques that promote GreenIR, improve experimental workflow, and facilitate collaboration by making cached results easily shareable.

Abstract

Modern information retrieval systems often rely on multiple components executed in a pipeline. In a research setting, this can lead to substantial redundant computations (e.g., retrieving the same query multiple times for evaluating different downstream rerankers). To overcome this, researchers take cached "result" files as inputs, which represent the output of another pipeline. However, these result files can be brittle and can cause a disconnect between the conceptual design of the pipeline and its logical implementation. To overcome both the redundancy problem (when executing complete pipelines) and the disconnect problem (when relying on intermediate result files), we describe our recent efforts to improve the caching capabilities in the open-source PyTerrier IR platform. We focus on two main directions: (1) automatic implicit caching of common pipeline prefixes when comparing systems and (2) explicit caching of operations through a new extension package, pyterrier-caching. These approaches allow for the best of both worlds: pipelines can be fully expressed end-to-end, while also avoiding redundant computations between pipelines.

Paper Structure

This paper contains 14 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: A visual depiction of the prefix precomputation approach for two pipelines: A >> B and A >> C. With prefix precomputation, the common A prefix is identified and the results are used for the computation of the remainder of both pipelines, i.e. B & C.
  • Figure 2: The KeyValueCache maps each key (consisting of one or more input column) to a value (one or more output columns). It assumes rows are treated independently and that the values only depend on the keys.
  • Figure 3: The RetrieverCache maps each key (consisting of one or more input column) to a value (many rows over one or more output columns). It assumes input rows are treated independently and that the values only depend on the keys.