Table of Contents
Fetching ...

DaisyRec 2.0: Benchmarking Recommendation for Rigorous Evaluation

Zhu Sun, Hui Fang, Jie Yang, Xinghua Qu, Hongyang Liu, Di Yu, Yew-Soon Ong, Jie Zhang

TL;DR

DaisyRec 2.0 targets rigorous, reproducible evaluation for recommender systems by combining a theory-driven hyper-factor taxonomy with a practical benchmarking toolkit. The authors classify evaluation-affecting factors into model-independent and model-dependent categories, identify evaluation modes, and validate these through a comprehensive empirical study using the DaisyRec 2.0 framework across six public datasets and six metrics. They release DaisyRec 2.0 to standardize data preprocessing, splitting, baselines, losses, sampling, initialization, optimization, and tuning, and provide benchmark results for ten state-of-the-art methods. Key findings include that dataset density does not always improve performance, simple baselines can outshine complex models under certain conditions, and optimal hyper-parameters for one metric may not generalize to others. The work establishes standardized procedures and a reference benchmark to enhance reproducibility and fair comparison, with future work extending to more tasks and evaluation aspects such as diversity and serendipity.

Abstract

Recently, one critical issue looms large in the field of recommender systems -- there are no effective benchmarks for rigorous evaluation -- which consequently leads to unreproducible evaluation and unfair comparison. We, therefore, conduct studies from the perspectives of practical theory and experiments, aiming at benchmarking recommendation for rigorous evaluation. Regarding the theoretical study, a series of hyper-factors affecting recommendation performance throughout the whole evaluation chain are systematically summarized and analyzed via an exhaustive review on 141 papers published at eight top-tier conferences within 2017-2020. We then classify them into model-independent and model-dependent hyper-factors, and different modes of rigorous evaluation are defined and discussed in-depth accordingly. For the experimental study, we release DaisyRec 2.0 library by integrating these hyper-factors to perform rigorous evaluation, whereby a holistic empirical study is conducted to unveil the impacts of different hyper-factors on recommendation performance. Supported by the theoretical and experimental studies, we finally create benchmarks for rigorous evaluation by proposing standardized procedures and providing performance of ten state-of-the-arts across six evaluation metrics on six datasets as a reference for later study. Overall, our work sheds light on the issues in recommendation evaluation, provides potential solutions for rigorous evaluation, and lays foundation for further investigation.

DaisyRec 2.0: Benchmarking Recommendation for Rigorous Evaluation

TL;DR

DaisyRec 2.0 targets rigorous, reproducible evaluation for recommender systems by combining a theory-driven hyper-factor taxonomy with a practical benchmarking toolkit. The authors classify evaluation-affecting factors into model-independent and model-dependent categories, identify evaluation modes, and validate these through a comprehensive empirical study using the DaisyRec 2.0 framework across six public datasets and six metrics. They release DaisyRec 2.0 to standardize data preprocessing, splitting, baselines, losses, sampling, initialization, optimization, and tuning, and provide benchmark results for ten state-of-the-art methods. Key findings include that dataset density does not always improve performance, simple baselines can outshine complex models under certain conditions, and optimal hyper-parameters for one metric may not generalize to others. The work establishes standardized procedures and a reference benchmark to enhance reproducibility and fair comparison, with future work extending to more tasks and evaluation aspects such as diversity and serendipity.

Abstract

Recently, one critical issue looms large in the field of recommender systems -- there are no effective benchmarks for rigorous evaluation -- which consequently leads to unreproducible evaluation and unfair comparison. We, therefore, conduct studies from the perspectives of practical theory and experiments, aiming at benchmarking recommendation for rigorous evaluation. Regarding the theoretical study, a series of hyper-factors affecting recommendation performance throughout the whole evaluation chain are systematically summarized and analyzed via an exhaustive review on 141 papers published at eight top-tier conferences within 2017-2020. We then classify them into model-independent and model-dependent hyper-factors, and different modes of rigorous evaluation are defined and discussed in-depth accordingly. For the experimental study, we release DaisyRec 2.0 library by integrating these hyper-factors to perform rigorous evaluation, whereby a holistic empirical study is conducted to unveil the impacts of different hyper-factors on recommendation performance. Supported by the theoretical and experimental studies, we finally create benchmarks for rigorous evaluation by proposing standardized procedures and providing performance of ten state-of-the-arts across six evaluation metrics on six datasets as a reference for later study. Overall, our work sheds light on the issues in recommendation evaluation, provides potential solutions for rigorous evaluation, and lays foundation for further investigation.
Paper Structure (36 sections, 4 equations, 13 figures, 19 tables)

This paper contains 36 sections, 4 equations, 13 figures, 19 tables.

Figures (13)

  • Figure 1: Hyper-factors within the whole recommendation evaluation chain.
  • Figure 2: (a) popularity of the top-15 datasets, where 'ML, AMZ' denote MovieLens and Amazon, respectively; (b) popularity of the top-15 baselines; and (c) popularity of the top-10 evaluation metrics. Note that the selected datasets, baselines and metrics in our study are highlighted in blue.
  • Figure 3: The overall structure of DaisyRec 2.0, composed of four components, i.e., GUI Command Generator, Loader, Recommender, and Evaluator.
  • Figure 4: An example of the generated tune command for Multi-VAE.
  • Figure 5: Performance of baselines w.r.t. time-aware split-by-ratio on the six datasets across origin, 5- and 10-filter settings.
  • ...and 8 more figures