From Variability to Stability: Advancing RecSys Benchmarking Practices
Valeriy Shevchenko, Nikita Belousov, Alexey Vasilev, Vladimir Zholobov, Artyom Sosedka, Natalia Semenova, Anna Volodkevich, Andrey Savchenko, Alexey Zaytsev
TL;DR
This work tackles the problem that RecSys algorithm evaluation is often biased by dataset choice, proposing a robust benchmarking pipeline built on $30$ open datasets (including two new ones) to compare $11$ collaborative filtering models across $9$ metrics. It systematically analyzes metric aggregation methods (MR, MA, DM-AUC, DM-LBO, Copeland, Minimax) and demonstrates that aggregation choice materially influences rankings, while certain methods (e.g., Geometric/Harmonic means, DM-LBO) show greater stability under perturbations. By linking dataset characteristics to performance and employing clustering to identify principal datasets, the authors show that a compact yet representative benchmarking subset can yield rankings similar to full benchmarks, enabling fairer and more scalable evaluation. The study identifies EASE as a consistently strong performer across aggregations and provides new datasets (MegaMarket, Zvuk) to broaden RecSys evaluation, offering practical guidance for reproducible offline benchmarking in industry and research contexts.
Abstract
In the rapidly evolving domain of Recommender Systems (RecSys), new algorithms frequently claim state-of-the-art performance based on evaluations over a limited set of arbitrarily selected datasets. However, this approach may fail to holistically reflect their effectiveness due to the significant impact of dataset characteristics on algorithm performance. Addressing this deficiency, this paper introduces a novel benchmarking methodology to facilitate a fair and robust comparison of RecSys algorithms, thereby advancing evaluation practices. By utilizing a diverse set of $30$ open datasets, including two introduced in this work, and evaluating $11$ collaborative filtering algorithms across $9$ metrics, we critically examine the influence of dataset characteristics on algorithm performance. We further investigate the feasibility of aggregating outcomes from multiple datasets into a unified ranking. Through rigorous experimental analysis, we validate the reliability of our methodology under the variability of datasets, offering a benchmarking strategy that balances quality and computational demands. This methodology enables a fair yet effective means of evaluating RecSys algorithms, providing valuable guidance for future research endeavors.
