Table of Contents
Fetching ...

Does It Look Sequential? An Analysis of Datasets for Evaluation of Sequential Recommendations

Anton Klenitskiy, Anna Volodkevich, Anton Pembek, Alexey Vasilev

TL;DR

The paper analyzes whether datasets used to evaluate sequential recommender systems actually encode sequential structure. It introduces three shuffling-based assessment methods, combining a model-agnostic sequential-rule analysis with model-based degradation and stability metrics using SASRec and GRU4Rec across 15 datasets. The findings show that several popular datasets (e.g., Foursquare, Gowalla, RetailRocket, Steam, Yelp) exhibit weak sequential structure, raising concerns about their suitability for SRS evaluation and highlighting the impact of preprocessing and dataset selection. The work provides a practical framework and guidance for dataset selection and assessment of sequential structure to ensure meaningful evaluation of sequential recommender methods.

Abstract

Sequential recommender systems are an important and demanded area of research. Such systems aim to use the order of interactions in a user's history to predict future interactions. The premise is that the order of interactions and sequential patterns play an essential role. Therefore, it is crucial to use datasets that exhibit a sequential structure to evaluate sequential recommenders properly. We apply several methods based on the random shuffling of the user's sequence of interactions to assess the strength of sequential structure across 15 datasets, frequently used for sequential recommender systems evaluation in recent research papers presented at top-tier conferences. As shuffling explicitly breaks sequential dependencies inherent in datasets, we estimate the strength of sequential patterns by comparing metrics for shuffled and original versions of the dataset. Our findings show that several popular datasets have a rather weak sequential structure.

Does It Look Sequential? An Analysis of Datasets for Evaluation of Sequential Recommendations

TL;DR

The paper analyzes whether datasets used to evaluate sequential recommender systems actually encode sequential structure. It introduces three shuffling-based assessment methods, combining a model-agnostic sequential-rule analysis with model-based degradation and stability metrics using SASRec and GRU4Rec across 15 datasets. The findings show that several popular datasets (e.g., Foursquare, Gowalla, RetailRocket, Steam, Yelp) exhibit weak sequential structure, raising concerns about their suitability for SRS evaluation and highlighting the impact of preprocessing and dataset selection. The work provides a practical framework and guidance for dataset selection and assessment of sequential structure to ensure meaningful evaluation of sequential recommender methods.

Abstract

Sequential recommender systems are an important and demanded area of research. Such systems aim to use the order of interactions in a user's history to predict future interactions. The premise is that the order of interactions and sequential patterns play an essential role. Therefore, it is crucial to use datasets that exhibit a sequential structure to evaluate sequential recommenders properly. We apply several methods based on the random shuffling of the user's sequence of interactions to assess the strength of sequential structure across 15 datasets, frequently used for sequential recommender systems evaluation in recent research papers presented at top-tier conferences. As shuffling explicitly breaks sequential dependencies inherent in datasets, we estimate the strength of sequential patterns by comparing metrics for shuffled and original versions of the dataset. Our findings show that several popular datasets have a rather weak sequential structure.
Paper Structure (16 sections, 3 figures, 3 tables)

This paper contains 16 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Data splitting strategy.
  • Figure 2: Ranks of the datasets according to different metrics considered. Rank 1 means this dataset has the strongest sequential structure according to the given metric. Datasets are sorted by an average of SASRec and GRU4Rec NDCG@10 ranks.
  • Figure 3: Spearman's correlation between all metrics: relative change in HitRate@10 and NDCG@10 for SASRec and GRU4Rec, Jaccard@10 for these models, and relative change in counts of 2-grams and 3-grams.