From Variability to Stability: Advancing RecSys Benchmarking Practices

Valeriy Shevchenko; Nikita Belousov; Alexey Vasilev; Vladimir Zholobov; Artyom Sosedka; Natalia Semenova; Anna Volodkevich; Andrey Savchenko; Alexey Zaytsev

From Variability to Stability: Advancing RecSys Benchmarking Practices

Valeriy Shevchenko, Nikita Belousov, Alexey Vasilev, Vladimir Zholobov, Artyom Sosedka, Natalia Semenova, Anna Volodkevich, Andrey Savchenko, Alexey Zaytsev

TL;DR

This work tackles the problem that RecSys algorithm evaluation is often biased by dataset choice, proposing a robust benchmarking pipeline built on $30$ open datasets (including two new ones) to compare $11$ collaborative filtering models across $9$ metrics. It systematically analyzes metric aggregation methods (MR, MA, DM-AUC, DM-LBO, Copeland, Minimax) and demonstrates that aggregation choice materially influences rankings, while certain methods (e.g., Geometric/Harmonic means, DM-LBO) show greater stability under perturbations. By linking dataset characteristics to performance and employing clustering to identify principal datasets, the authors show that a compact yet representative benchmarking subset can yield rankings similar to full benchmarks, enabling fairer and more scalable evaluation. The study identifies EASE as a consistently strong performer across aggregations and provides new datasets (MegaMarket, Zvuk) to broaden RecSys evaluation, offering practical guidance for reproducible offline benchmarking in industry and research contexts.

Abstract

In the rapidly evolving domain of Recommender Systems (RecSys), new algorithms frequently claim state-of-the-art performance based on evaluations over a limited set of arbitrarily selected datasets. However, this approach may fail to holistically reflect their effectiveness due to the significant impact of dataset characteristics on algorithm performance. Addressing this deficiency, this paper introduces a novel benchmarking methodology to facilitate a fair and robust comparison of RecSys algorithms, thereby advancing evaluation practices. By utilizing a diverse set of $30$ open datasets, including two introduced in this work, and evaluating $11$ collaborative filtering algorithms across $9$ metrics, we critically examine the influence of dataset characteristics on algorithm performance. We further investigate the feasibility of aggregating outcomes from multiple datasets into a unified ranking. Through rigorous experimental analysis, we validate the reliability of our methodology under the variability of datasets, offering a benchmarking strategy that balances quality and computational demands. This methodology enables a fair yet effective means of evaluating RecSys algorithms, providing valuable guidance for future research endeavors.

From Variability to Stability: Advancing RecSys Benchmarking Practices

TL;DR

This work tackles the problem that RecSys algorithm evaluation is often biased by dataset choice, proposing a robust benchmarking pipeline built on

open datasets (including two new ones) to compare

collaborative filtering models across

metrics. It systematically analyzes metric aggregation methods (MR, MA, DM-AUC, DM-LBO, Copeland, Minimax) and demonstrates that aggregation choice materially influences rankings, while certain methods (e.g., Geometric/Harmonic means, DM-LBO) show greater stability under perturbations. By linking dataset characteristics to performance and employing clustering to identify principal datasets, the authors show that a compact yet representative benchmarking subset can yield rankings similar to full benchmarks, enabling fairer and more scalable evaluation. The study identifies EASE as a consistently strong performer across aggregations and provides new datasets (MegaMarket, Zvuk) to broaden RecSys evaluation, offering practical guidance for reproducible offline benchmarking in industry and research contexts.

Abstract

open datasets, including two introduced in this work, and evaluating

collaborative filtering algorithms across

metrics, we critically examine the influence of dataset characteristics on algorithm performance. We further investigate the feasibility of aggregating outcomes from multiple datasets into a unified ranking. Through rigorous experimental analysis, we validate the reliability of our methodology under the variability of datasets, offering a benchmarking strategy that balances quality and computational demands. This methodology enables a fair yet effective means of evaluating RecSys algorithms, providing valuable guidance for future research endeavors.

Paper Structure (35 sections, 4 equations, 9 figures, 5 tables)

This paper contains 35 sections, 4 equations, 9 figures, 5 tables.

Introduction
Related work
Methodology
Datasets and Preprocessing
Recommendation Models
Evaluation Settings
Metrics Aggregation Methods
EXPERIMENTS AND RESULTS
Metrics
Comparative Analysis of Metrics Aggregation Methods
Considered aggregation approaches
Mean Ranks (MR)
Mean Aggregations
Dolan-Moré Area Under Curve (DM-AUC)
Dolan-Moré leave-best-out (DM LBO)
...and 20 more sections

Figures (9)

Figure 1: Benchmarking methodology for ranking algorithms. Our main innovations are the curated list of datasets that enable the option of comparison of pairs of models and aggregation strategies that provide principled ranking of approaches w.r.t. various criteria.
Figure 2: Spearman correlation between metrics for $k=10$. Darker blue indicates stronger correlations.
Figure 3: Performance profiles for the comparison of RecSys algorithms. The higher the curve, the better the performance of the algorithm. We also provide AUCs for each approach.
Figure 4: The Critical Difference diagram for the comparison of RecSys algorithms. The numbers represent the mean ranks of methods over all datasets. Thick horizontal lines represent a non-significance based on the Wilcoxon-Holmes test, while dashed horizontal lines represent non-significance according to the Bayesian Signed-Rank test.
Figure 5: Stability of aggregations with respect to the number of used datasets.
...and 4 more figures

From Variability to Stability: Advancing RecSys Benchmarking Practices

TL;DR

Abstract

From Variability to Stability: Advancing RecSys Benchmarking Practices

Authors

TL;DR

Abstract

Table of Contents

Figures (9)