Table of Contents
Fetching ...

A Reproducible Analysis of Sequential Recommender Systems

Filippo Betello, Antonio Purificato, Federico Siciliano, Giovanni Trappolini, Andrea Bacciu, Nicola Tonellotto, Fabrizio Silvestri

TL;DR

The paper tackles reproducibility gaps in Sequential Recommender Systems (SRSs) by introducing EasyRec, a standardized framework for data preprocessing and model implementation that enables fair, repeatable evaluations across datasets. It conducts extensive, controlled experiments on multiple benchmarks, re-evaluating classic SRS models (e.g., GRU4Rec, SASRec, BERT4Rec, NARM, CORE) under consistent settings and tracking energy emissions. Key findings include that GRU4Rec can outperform others on MovieLens datasets, while transformer-based models like SASRec excel with larger embedding sizes; longer input sequences generally help attention-based models, though effects are dataset-dependent. The work emphasizes the importance of standardized benchmarks and sustainability considerations, providing an open-source path to robust, comparable SRS research and benchmarking.

Abstract

Sequential Recommender Systems (SRSs) have emerged as a highly efficient approach to recommendation systems. By leveraging sequential data, SRSs can identify temporal patterns in user behaviour, significantly improving recommendation accuracy and relevance.Ensuring the reproducibility of these models is paramount for advancing research and facilitating comparisons between them. Existing works exhibit shortcomings in reproducibility and replicability of results, leading to inconsistent statements across papers. Our work fills these gaps by standardising data pre-processing and model implementations, providing a comprehensive code resource, including a framework for developing SRSs and establishing a foundation for consistent and reproducible experimentation. We conduct extensive experiments on several benchmark datasets, comparing various SRSs implemented in our resource. We challenge prevailing performance benchmarks, offering new insights into the SR domain. For instance, SASRec does not consistently outperform GRU4Rec. On the contrary, when the number of model parameters becomes substantial, SASRec starts to clearly dominate all the other SRSs. This discrepancy underscores the significant impact that experimental configuration has on the outcomes and the importance of setting it up to ensure precise and comprehensive results. Failure to do so can lead to significantly flawed conclusions, highlighting the need for rigorous experimental design and analysis in SRS research. Our code is available at https://github.com/antoniopurificato/recsys_repro_conf.

A Reproducible Analysis of Sequential Recommender Systems

TL;DR

The paper tackles reproducibility gaps in Sequential Recommender Systems (SRSs) by introducing EasyRec, a standardized framework for data preprocessing and model implementation that enables fair, repeatable evaluations across datasets. It conducts extensive, controlled experiments on multiple benchmarks, re-evaluating classic SRS models (e.g., GRU4Rec, SASRec, BERT4Rec, NARM, CORE) under consistent settings and tracking energy emissions. Key findings include that GRU4Rec can outperform others on MovieLens datasets, while transformer-based models like SASRec excel with larger embedding sizes; longer input sequences generally help attention-based models, though effects are dataset-dependent. The work emphasizes the importance of standardized benchmarks and sustainability considerations, providing an open-source path to robust, comparable SRS research and benchmarking.

Abstract

Sequential Recommender Systems (SRSs) have emerged as a highly efficient approach to recommendation systems. By leveraging sequential data, SRSs can identify temporal patterns in user behaviour, significantly improving recommendation accuracy and relevance.Ensuring the reproducibility of these models is paramount for advancing research and facilitating comparisons between them. Existing works exhibit shortcomings in reproducibility and replicability of results, leading to inconsistent statements across papers. Our work fills these gaps by standardising data pre-processing and model implementations, providing a comprehensive code resource, including a framework for developing SRSs and establishing a foundation for consistent and reproducible experimentation. We conduct extensive experiments on several benchmark datasets, comparing various SRSs implemented in our resource. We challenge prevailing performance benchmarks, offering new insights into the SR domain. For instance, SASRec does not consistently outperform GRU4Rec. On the contrary, when the number of model parameters becomes substantial, SASRec starts to clearly dominate all the other SRSs. This discrepancy underscores the significant impact that experimental configuration has on the outcomes and the importance of setting it up to ensure precise and comprehensive results. Failure to do so can lead to significantly flawed conclusions, highlighting the need for rigorous experimental design and analysis in SRS research. Our code is available at https://github.com/antoniopurificato/recsys_repro_conf.
Paper Structure (22 sections, 5 figures, 3 tables)

This paper contains 22 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: A visual depiction illustrating the core functionality of the considered SRSs. The input items ${i_1, i_2, \dots, i_L}$ are processed to generate the final representations ${z_1, z_2, \dots, z_L}$, which are utilized to generate predictions at steps $1, 2, \dots, L$, respectively. Intermediate representations ${h_1, h_2, \dots, h_L}$ are also present for some models.
  • Figure 2: Effect of input sequence length on model performance, as measured by NDCG@10. Each plot shows the results of the five models on one dataset.
  • Figure 3: Effect of embedding size on model performance, as measured by NDCG@10. Each plot shows the results of the five models on one dataset.
  • Figure 4: Effect of total number of model's parameters on the performance, as measured by NDCG@10. Each plot shows the results of the five models on one dataset.
  • Figure 5: Relation between emissions, measured as CO2-eq in Kg, and performance, measured by NDCG@10. Each plot shows the results of the five models on one dataset.