Table of Contents
Fetching ...

DataRec: A Python Library for Standardized and Reproducible Data Management in Recommender Systems

Alberto Carlo Maria Mancino, Salvatore Bufi, Angela Di Fazio, Antonio Ferrara, Daniele Malitesta, Claudio Pomo, Tommaso Di Noia

TL;DR

The paper addresses reproducibility challenges in recommender system research arising from fragmented data management. It proposes DataRec, a modular, open-source Python library that standardizes dataset handling, versioning, preprocessing, and splitting, with explicit interoperability for multiple frameworks. Through a survey of 55 papers (2020–2024), the authors identify prevalent dataset usage, preprocessing, and splitting practices and build DataRec to reflect these best practices, including 16 built-in datasets and robust reproducibility traces. The work aims to foster fair benchmarking and broader trust in experimental results by providing traceable, exportable data pipelines that integrate with existing frameworks, and it outlines future expansion to incorporate more datasets and side-information management.

Abstract

Recommender systems have demonstrated significant impact across diverse domains, yet ensuring the reproducibility of experimental findings remains a persistent challenge. A primary obstacle lies in the fragmented and often opaque data management strategies employed during the preprocessing stage, where decisions about dataset selection, filtering, and splitting can substantially influence outcomes. To address these limitations, we introduce DataRec, an open-source Python-based library specifically designed to unify and streamline data handling in recommender system research. By providing reproducible routines for dataset preparation, data versioning, and seamless integration with other frameworks, DataRec promotes methodological standardization, interoperability, and comparability across different experimental setups. Our design is informed by an in-depth review of 55 state-of-the-art recommendation studies ensuring that DataRec adopts best practices while addressing common pitfalls in data management. Ultimately, our contribution facilitates fair benchmarking, enhances reproducibility, and fosters greater trust in experimental results within the broader recommender systems community. The DataRec library, documentation, and examples are freely available at https://github.com/sisinflab/DataRec.

DataRec: A Python Library for Standardized and Reproducible Data Management in Recommender Systems

TL;DR

The paper addresses reproducibility challenges in recommender system research arising from fragmented data management. It proposes DataRec, a modular, open-source Python library that standardizes dataset handling, versioning, preprocessing, and splitting, with explicit interoperability for multiple frameworks. Through a survey of 55 papers (2020–2024), the authors identify prevalent dataset usage, preprocessing, and splitting practices and build DataRec to reflect these best practices, including 16 built-in datasets and robust reproducibility traces. The work aims to foster fair benchmarking and broader trust in experimental results by providing traceable, exportable data pipelines that integrate with existing frameworks, and it outlines future expansion to incorporate more datasets and side-information management.

Abstract

Recommender systems have demonstrated significant impact across diverse domains, yet ensuring the reproducibility of experimental findings remains a persistent challenge. A primary obstacle lies in the fragmented and often opaque data management strategies employed during the preprocessing stage, where decisions about dataset selection, filtering, and splitting can substantially influence outcomes. To address these limitations, we introduce DataRec, an open-source Python-based library specifically designed to unify and streamline data handling in recommender system research. By providing reproducible routines for dataset preparation, data versioning, and seamless integration with other frameworks, DataRec promotes methodological standardization, interoperability, and comparability across different experimental setups. Our design is informed by an in-depth review of 55 state-of-the-art recommendation studies ensuring that DataRec adopts best practices while addressing common pitfalls in data management. Ultimately, our contribution facilitates fair benchmarking, enhances reproducibility, and fosters greater trust in experimental results within the broader recommender systems community. The DataRec library, documentation, and examples are freely available at https://github.com/sisinflab/DataRec.

Paper Structure

This paper contains 17 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Overview of the DataRec architecture. The DataRec class provides key dataset metrics and interacts with three modules: I/O for handling different data formats, processing for dataset transformations, and splitting for partitioning into training, validation, and test sets.