Challenging the Myth of Graph Collaborative Filtering: a Reasoned and Reproducibility-driven Analysis

Vito Walter Anelli; Daniele Malitesta; Claudio Pomo; Alejandro Bellogín; Tommaso Di Noia; Eugenio Di Sciascio

Challenging the Myth of Graph Collaborative Filtering: a Reasoned and Reproducibility-driven Analysis

Vito Walter Anelli, Daniele Malitesta, Claudio Pomo, Alejandro Bellogín, Tommaso Di Noia, Eugenio Di Sciascio

TL;DR

This work tackles reproducibility in graph-based recommender systems by re-implementing and replicating six prominent graph CF models (NGCF, DGCF, LightGCN, SGL, UltraGCN, GFCF) across Gowalla, Yelp 2018, and Amazon Book, and extending evaluation to Allrecipes and BookCrossing to probe how dataset characteristics affect performance. It combines replication with a broader baseline comparison, introducing strong classic baselines (e.g., RP$^3\beta$, EASE$^R$) to assess relative standing and reveals that no graph CF method universally dominates across all datasets. The study also introduces a topology-aware analysis of information flow from multi-hop neighborhoods, showing that 1-hop activeness and 2-hop item popularity strongly influence recommendations and that performance correlates with dataset structure and quartile-based user groups. Overall, the findings emphasize the need for rigorous baselines and dataset-aware evaluation in graph CF, and they provide publicly available code to facilitate further reproducibility and extension in the field.

Abstract

The success of graph neural network-based models (GNNs) has significantly advanced recommender systems by effectively modeling users and items as a bipartite, undirected graph. However, many original graph-based works often adopt results from baseline papers without verifying their validity for the specific configuration under analysis. Our work addresses this issue by focusing on the replicability of results. We present a code that successfully replicates results from six popular and recent graph recommendation models (NGCF, DGCF, LightGCN, SGL, UltraGCN, and GFCF) on three common benchmark datasets (Gowalla, Yelp 2018, and Amazon Book). Additionally, we compare these graph models with traditional collaborative filtering models that historically performed well in offline evaluations. Furthermore, we extend our study to two new datasets (Allrecipes and BookCrossing) that lack established setups in existing literature. As the performance on these datasets differs from the previous benchmarks, we analyze the impact of specific dataset characteristics on recommendation accuracy. By investigating the information flow from users' neighborhoods, we aim to identify which models are influenced by intrinsic features in the dataset structure. The code to reproduce our experiments is available at: https://github.com/sisinflab/Graph-RSs-Reproducibility.

Challenging the Myth of Graph Collaborative Filtering: a Reasoned and Reproducibility-driven Analysis

TL;DR

, EASE

) to assess relative standing and reveals that no graph CF method universally dominates across all datasets. The study also introduces a topology-aware analysis of information flow from multi-hop neighborhoods, showing that 1-hop activeness and 2-hop item popularity strongly influence recommendations and that performance correlates with dataset structure and quartile-based user groups. Overall, the findings emphasize the need for rigorous baselines and dataset-aware evaluation in graph CF, and they provide publicly available code to facilitate further reproducibility and extension in the field.

Abstract

Paper Structure (19 sections, 2 equations, 2 figures, 7 tables)

This paper contains 19 sections, 2 equations, 2 figures, 7 tables.

Introduction and related work
Background and reproducibility analysis
Graph collaborative filtering
Analysis on reported baselines
Analysis on reported datasets
Analysis on experimental comparison
Replication of prior results (RQ1)
Settings
Results
Benchmarking graph CF approaches using alternative baselines (RQ2)
Settings
Results
Extending the experimental comparison to new datasets (RQ3 --- RQ4)
Settings
Results
...and 4 more sections

Figures (2)

Figure 1: A toy user-item graph where the ego user node (highlighted) receives the information flow from the (a) 1-, (b) 2-, and (c) 3-hop neighbor nodes (highlighted). Arrows' direction is a visual representation of the information flow.
Figure 2: Percentage variation between the nDCG on user quartiles and the average nDCG value across all users (indicated as the dashed line), for each model-dataset setting. Rows refer to user quartiles when considering (a) 1-, (b) 2-, and (c) 3-hop.

Challenging the Myth of Graph Collaborative Filtering: a Reasoned and Reproducibility-driven Analysis

TL;DR

Abstract

Challenging the Myth of Graph Collaborative Filtering: a Reasoned and Reproducibility-driven Analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (2)