Table of Contents
Fetching ...

Rank-Preference Consistency as the Appropriate Metric for Recommender Systems

Tung Nguyen, Jeffrey Uhlmann

TL;DR

This work questions the primacy of RMSE/MAE as RS performance metrics, arguing they do not directly reflect users' rank-order preferences. It introduces rank-preference consistency, measured by discordant prediction pairs, as a more natural criterion and evaluates UC/SC, SVD variants, and GLocalK on benchmark datasets. Empirical results show UC/SC and GLocalK achieve higher rank-consistency than SVD variants, suggesting that unitary-invariant metrics can be misleading for RS evaluation. The findings advocate for evaluation and design prioritizing rank-order consistency, with consistency-based methods offering transparent guarantees absent in many AI-based approaches.

Abstract

In this paper we argue that conventional unitary-invariant measures of recommender system (RS) performance based on measuring differences between predicted ratings and actual user ratings fail to assess fundamental RS properties. More specifically, posing the optimization problem as one of predicting exact user ratings provides only an indirect suboptimal approximation for what RS applications typically need, which is an ability to accurately predict user preferences. We argue that scalar measures such as RMSE and MAE with respect to differences between actual and predicted ratings are only proxies for measuring RS ability to accurately estimate user preferences. We propose what we consider to be a measure that is more fundamentally appropriate for assessing RS performance, rank-preference consistency, which simply counts the number of prediction pairs that are inconsistent with the user's expressed product preferences. For example, if an RS predicts the user will prefer product A over product B, but the user's withheld ratings indicate s/he prefers product B over A, then rank-preference consistency has been violated. Our test results conclusively demonstrate that methods tailored to optimize arbitrary measures such as RMSE are not generally effective at accurately predicting user preferences. Thus, we conclude that conventional methods used for assessing RS performance are arbitrary and misleading.

Rank-Preference Consistency as the Appropriate Metric for Recommender Systems

TL;DR

This work questions the primacy of RMSE/MAE as RS performance metrics, arguing they do not directly reflect users' rank-order preferences. It introduces rank-preference consistency, measured by discordant prediction pairs, as a more natural criterion and evaluates UC/SC, SVD variants, and GLocalK on benchmark datasets. Empirical results show UC/SC and GLocalK achieve higher rank-consistency than SVD variants, suggesting that unitary-invariant metrics can be misleading for RS evaluation. The findings advocate for evaluation and design prioritizing rank-order consistency, with consistency-based methods offering transparent guarantees absent in many AI-based approaches.

Abstract

In this paper we argue that conventional unitary-invariant measures of recommender system (RS) performance based on measuring differences between predicted ratings and actual user ratings fail to assess fundamental RS properties. More specifically, posing the optimization problem as one of predicting exact user ratings provides only an indirect suboptimal approximation for what RS applications typically need, which is an ability to accurately predict user preferences. We argue that scalar measures such as RMSE and MAE with respect to differences between actual and predicted ratings are only proxies for measuring RS ability to accurately estimate user preferences. We propose what we consider to be a measure that is more fundamentally appropriate for assessing RS performance, rank-preference consistency, which simply counts the number of prediction pairs that are inconsistent with the user's expressed product preferences. For example, if an RS predicts the user will prefer product A over product B, but the user's withheld ratings indicate s/he prefers product B over A, then rank-preference consistency has been violated. Our test results conclusively demonstrate that methods tailored to optimize arbitrary measures such as RMSE are not generally effective at accurately predicting user preferences. Thus, we conclude that conventional methods used for assessing RS performance are arbitrary and misleading.
Paper Structure (7 sections, 4 figures, 1 table)

This paper contains 7 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Distinguishing rating 5 from rating 1. (Left) ML-1M dataset, (Right) Douban dataset
  • Figure 2: Distinguishing rating 5 from rating 2. (Left) ML-1M dataset, (Right) Douban dataset
  • Figure 3: Distinguishing rating 5 from rating 3. (Left) ML-1M dataset, (Right) Douban dataset
  • Figure 4: Distinguishing rating 5 from rating 4. (Left) ML-1M dataset, (Right) Douban dataset