Revealing the Hidden Impact of Top-N Metrics on Optimization in Recommender Systems
Lukas Wegmeth, Tobias Vente, Lennart Purucker
TL;DR
This paper investigates whether selecting items beyond the top-$n$ during evaluation can improve predictive accuracy in top-$n$ recommender systems, addressing a potential misalignment between training losses and optimization-target metrics. By exhaustively evaluating 252 selection strategies (choosing $n$ items from the top-$|K|$ items with $|K|=10$ and $n=5$) across 12 implicit and 8 explicit datasets and 11 algorithms, the study shows that while non-top-$n$ strategies can outperform top-$n$ in some cases, the gains are negligible on average. Statistical analyses indicate most strategies are not significantly different, and high generalization from validation to test supports reliable conclusions. The findings suggest there is no practical need to optimize top-$n$ selection strategies, reinforcing the validity of standard evaluation practice and contributing to reproducibility; the work also highlights domain- and algorithm-dependent nuances that warrant awareness. The authors provide public code to reproduce the experiments and encourage future work on efficient search methods for near-top selections.
Abstract
The hyperparameters of recommender systems for top-n predictions are typically optimized to enhance the predictive performance of algorithms. Thereby, the optimization algorithm, e.g., grid search or random search, searches for the best hyperparameter configuration according to an optimization-target metric, like nDCG or Precision. In contrast, the optimized algorithm, internally optimizes a different loss function during training, like squared error or cross-entropy. To tackle this discrepancy, recent work focused on generating loss functions better suited for recommender systems. Yet, when evaluating an algorithm using a top-n metric during optimization, another discrepancy between the optimization-target metric and the training loss has so far been ignored. During optimization, the top-n items are selected for computing a top-n metric; ignoring that the top-n items are selected from the recommendations of a model trained with an entirely different loss function. Item recommendations suitable for optimization-target metrics could be outside the top-n recommended items; hiddenly impacting the optimization performance. Therefore, we were motivated to analyze whether the top-n items are optimal for optimization-target top-n metrics. In pursuit of an answer, we exhaustively evaluate the predictive performance of 250 selection strategies besides selecting the top-n. We extensively evaluate each selection strategy over twelve implicit feedback and eight explicit feedback data sets with eleven recommender systems algorithms. Our results show that there exist selection strategies other than top-n that increase predictive performance for various algorithms and recommendation domains. However, the performance of the top ~43% of selection strategies is not significantly different. We discuss the impact of our findings on optimization and re-ranking in recommender systems and feasible solutions.
