Table of Contents
Fetching ...

Reproducibility and Artifact Consistency of the SIGIR 2022 Recommender Systems Papers Based on Message Passing

Maurizio Ferrari Dacrema, Michael Benigni, Nicola Ferro

TL;DR

The paper scrutinizes reproducibility and artifact consistency for nine SIGIR 2022 graph-based recommender systems, revealing widespread issues such as erroneous data splits, information leakage, and misalignment between artifacts and paper descriptions. It shows that many claimed improvements fail to beat simple baselines, especially on Amazon-Book, and that artifact quality varies, with only a minority achieving functional reproducibility. The authors attempted re-executions, hyperparameter re-tuning, and independent baselines, highlighting how optimization and data handling choices shape outcomes and often undermine cross-study comparability. They also analyze how these reproducibility problems influence follow-up SIGIR 2023 work, underscoring the need for transparent reporting, standardized datasets, and stronger baselines to advance the field reliably.

Abstract

Graph-based techniques relying on neural networks and embeddings have gained attention as a way to develop Recommender Systems (RS) with several papers on the topic presented at SIGIR 2022 and 2023. Given the importance of ensuring that published research is methodologically sound and reproducible, in this paper we analyze 10 graph-based RS papers, most of which were published at SIGIR 2022, and assess their impact on subsequent work published in SIGIR 2023. Our analysis reveals several critical points that require attention: (i) the prevalence of bad practices, such as erroneous data splits or information leakage between training and testing data, which call into question the validity of the results; (ii) frequent inconsistencies between the provided artifacts (source code and data) and their descriptions in the paper, causing uncertainty about what is actually being evaluated; and (iii) the preference for new or complex baselines that are weaker compared to simpler ones, creating the impression of continuous improvement even when, particularly for the Amazon-Book dataset, the state-of-the-art has significantly worsened. Due to these issues, we are unable to confirm the claims made in most of the papers we examined and attempted to reproduce.

Reproducibility and Artifact Consistency of the SIGIR 2022 Recommender Systems Papers Based on Message Passing

TL;DR

The paper scrutinizes reproducibility and artifact consistency for nine SIGIR 2022 graph-based recommender systems, revealing widespread issues such as erroneous data splits, information leakage, and misalignment between artifacts and paper descriptions. It shows that many claimed improvements fail to beat simple baselines, especially on Amazon-Book, and that artifact quality varies, with only a minority achieving functional reproducibility. The authors attempted re-executions, hyperparameter re-tuning, and independent baselines, highlighting how optimization and data handling choices shape outcomes and often undermine cross-study comparability. They also analyze how these reproducibility problems influence follow-up SIGIR 2023 work, underscoring the need for transparent reporting, standardized datasets, and stronger baselines to advance the field reliably.

Abstract

Graph-based techniques relying on neural networks and embeddings have gained attention as a way to develop Recommender Systems (RS) with several papers on the topic presented at SIGIR 2022 and 2023. Given the importance of ensuring that published research is methodologically sound and reproducible, in this paper we analyze 10 graph-based RS papers, most of which were published at SIGIR 2022, and assess their impact on subsequent work published in SIGIR 2023. Our analysis reveals several critical points that require attention: (i) the prevalence of bad practices, such as erroneous data splits or information leakage between training and testing data, which call into question the validity of the results; (ii) frequent inconsistencies between the provided artifacts (source code and data) and their descriptions in the paper, causing uncertainty about what is actually being evaluated; and (iii) the preference for new or complex baselines that are weaker compared to simpler ones, creating the impression of continuous improvement even when, particularly for the Amazon-Book dataset, the state-of-the-art has significantly worsened. Due to these issues, we are unable to confirm the claims made in most of the papers we examined and attempted to reproduce.

Paper Structure

This paper contains 77 sections, 2 figures, 16 tables.

Figures (2)

  • Figure 1: Normalized popularity distributions of the training and test data splits for Yelp2018 used in the LightGCN paper, the value 1 corresponds to the most popular item in that split. Figure \ref{['fig:LightGCN_yelp2018_ours_popularity_plot']} shows the expected popularity distribution for a random holdout data split, with the normalized values on both training and validation being on average similar. Figure \ref{['fig:LightGCN_yelp2018_original_popularity_plot']} shows instead the distribution in the original data splits, as can be seen the training and test distributions are different.
  • Figure 2: Normalized popularity distributions of the original training and test data splits for Yelp2018 used by the HAKG paper.