Table of Contents
Fetching ...

Trove: A Flexible Toolkit for Dense Retrieval

Reza Esfandiarpoor, Max Zuo, Stephen H. Bach

TL;DR

Trove addresses the substantial engineering burden in retrieval experiments by offering on-the-fly data loading and processing with a memory-efficient representation, and by unifying evaluation and hard negative mining across multi-node deployments. Its modular design exposes the main pipeline components (data, modeling, training, and inference) for easy customization while maintaining compatibility with the HF transformers ecosystem. Key contributions include MaterializedQRel with memory-mapped Apache Arrow storage, a configuration-driven workflow, and fast, scalable inference aided by a specialized top-k tracker. Together, these advances reduce memory usage, accelerate experimentation, and enable flexible, scalable retrieval research with minimal code changes.

Abstract

We introduce Trove, an easy-to-use open-source retrieval toolkit that simplifies research experiments without sacrificing flexibility or speed. For the first time, we introduce efficient data management features that load and process (filter, select, transform, and combine) retrieval datasets on the fly, with just a few lines of code. This gives users the flexibility to easily experiment with different dataset configurations without the need to compute and store multiple copies of large datasets. Trove is highly customizable: in addition to many built-in options, it allows users to freely modify existing components or replace them entirely with user-defined objects. It also provides a low-code and unified pipeline for evaluation and hard negative mining, which supports multi-node execution without any code changes. Trove's data management features reduce memory consumption by a factor of 2.6. Moreover, Trove's easy-to-use inference pipeline incurs no overhead, and inference times decrease linearly with the number of available nodes. Most importantly, we demonstrate how Trove simplifies retrieval experiments and allows for arbitrary customizations, thus facilitating exploratory research.

Trove: A Flexible Toolkit for Dense Retrieval

TL;DR

Trove addresses the substantial engineering burden in retrieval experiments by offering on-the-fly data loading and processing with a memory-efficient representation, and by unifying evaluation and hard negative mining across multi-node deployments. Its modular design exposes the main pipeline components (data, modeling, training, and inference) for easy customization while maintaining compatibility with the HF transformers ecosystem. Key contributions include MaterializedQRel with memory-mapped Apache Arrow storage, a configuration-driven workflow, and fast, scalable inference aided by a specialized top-k tracker. Together, these advances reduce memory usage, accelerate experimentation, and enable flexible, scalable retrieval research with minimal code changes.

Abstract

We introduce Trove, an easy-to-use open-source retrieval toolkit that simplifies research experiments without sacrificing flexibility or speed. For the first time, we introduce efficient data management features that load and process (filter, select, transform, and combine) retrieval datasets on the fly, with just a few lines of code. This gives users the flexibility to easily experiment with different dataset configurations without the need to compute and store multiple copies of large datasets. Trove is highly customizable: in addition to many built-in options, it allows users to freely modify existing components or replace them entirely with user-defined objects. It also provides a low-code and unified pipeline for evaluation and hard negative mining, which supports multi-node execution without any code changes. Trove's data management features reduce memory consumption by a factor of 2.6. Moreover, Trove's easy-to-use inference pipeline incurs no overhead, and inference times decrease linearly with the number of available nodes. Most importantly, we demonstrate how Trove simplifies retrieval experiments and allows for arbitrary customizations, thus facilitating exploratory research.

Paper Structure

This paper contains 21 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: A) Existing toolkits require manually creating and maintaining large pre-processed data files for each experiment. B) Trove processes datasets on the fly based on the given configuration options.
  • Figure 2: Training and evaluation workflow with Trove.
  • Figure 3: Training with Mined Hard Negatives