Table of Contents
Fetching ...

Calibrating the Predictions for Top-N Recommendations

Masahiro Sato

TL;DR

The paper tackles the problem that top-N calibrated predictions in recommender systems can be miscalibrated even when global calibration appears solid. It introduces $ECE@N$ and rank-discounted $RDECE@N$ to specifically evaluate calibration quality among top-N items and proposes a generic top-N focused calibration optimization (TNF) that groups top-N items by rank and learns rank-aware calibration mappings with rank-dependent weights. The study demonstrates that TNF reduces calibration errors across explicit and implicit datasets and a variety of recommender and calibration models, while baselines trained on all items and recent debiasing approaches may fail or underperform. The findings highlight the importance of rank-aware calibration for top-N recommendations and provide a practical framework to improve the reliability of top-N predictions in real-world systems.

Abstract

Well-calibrated predictions of user preferences are essential for many applications. Since recommender systems typically select the top-N items for users, calibration for those top-N items, rather than for all items, is important. We show that previous calibration methods result in miscalibrated predictions for the top-N items, despite their excellent calibration performance when evaluated on all items. In this work, we address the miscalibration in the top-N recommended items. We first define evaluation metrics for this objective and then propose a generic method to optimize calibration models focusing on the top-N items. It groups the top-N items by their ranks and optimizes distinct calibration models for each group with rank-dependent training weights. We verify the effectiveness of the proposed method for both explicit and implicit feedback datasets, using diverse classes of recommender models.

Calibrating the Predictions for Top-N Recommendations

TL;DR

The paper tackles the problem that top-N calibrated predictions in recommender systems can be miscalibrated even when global calibration appears solid. It introduces and rank-discounted to specifically evaluate calibration quality among top-N items and proposes a generic top-N focused calibration optimization (TNF) that groups top-N items by rank and learns rank-aware calibration mappings with rank-dependent weights. The study demonstrates that TNF reduces calibration errors across explicit and implicit datasets and a variety of recommender and calibration models, while baselines trained on all items and recent debiasing approaches may fail or underperform. The findings highlight the importance of rank-aware calibration for top-N recommendations and provide a practical framework to improve the reliability of top-N predictions in real-world systems.

Abstract

Well-calibrated predictions of user preferences are essential for many applications. Since recommender systems typically select the top-N items for users, calibration for those top-N items, rather than for all items, is important. We show that previous calibration methods result in miscalibrated predictions for the top-N items, despite their excellent calibration performance when evaluated on all items. In this work, we address the miscalibration in the top-N recommended items. We first define evaluation metrics for this objective and then propose a generic method to optimize calibration models focusing on the top-N items. It groups the top-N items by their ranks and optimizes distinct calibration models for each group with rank-dependent training weights. We verify the effectiveness of the proposed method for both explicit and implicit feedback datasets, using diverse classes of recommender models.
Paper Structure (14 sections, 3 equations, 6 figures, 4 tables)

This paper contains 14 sections, 3 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Calibration plots for NCF with Gaussian calibration applied to preference prediction in the Kuairec dataset.
  • Figure 2: Proposed calibration method. Top-N items are grouped by their ranks. Then calibration models for each ranking group are trained with weights that decrease with the ranks of each training sample.
  • Figure 3: ECE@N for varied number of recommendations.
  • Figure 4: The sensitivities to the discounting factor $\alpha$ and the number of groups $n_g$. Recommender models are ItemKNN and NCF and calibration models are isotonic regression and Beta calibration in ML-1M and KuaiRec, respectively.
  • Figure 5: ECE@N for varied N.
  • ...and 1 more figures