On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-$n$ Recommendation

Olivier Jeunen; Ivan Potapov; Aleksei Ustimenko

On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-$n$ Recommendation

Olivier Jeunen, Ivan Potapov, Aleksei Ustimenko

TL;DR

This work interrogates the use of (normalised) Discounted Cumulative Gain (($DCG$, $nDCG$)) as an offline evaluation metric for top-$n$ recommender systems. It formalises when $DCG$ acts as an unbiased estimator of online reward within a counterfactual framework and demonstrates that normalising to $nDCG$ destroys this property, potentially reversing model rankings. Through large-scale offline–online experiments, the authors show that $DCG$ correlates strongly with online reward and offers higher statistical sensitivity to detect improvements, whereas $nDCG$ exhibits weak or negative correlations and reduced sensitivity. The results advocate using $DCG$ (with appropriate variance-control techniques like clipping) for offline evaluation and model selection, while highlighting the need to revisit the assumptions behind these metrics and to explore extensions to broader online metrics and biases.

Abstract

Approaches to recommendation are typically evaluated in one of two ways: (1) via a (simulated) online experiment, often seen as the gold standard, or (2) via some offline evaluation procedure, where the goal is to approximate the outcome of an online experiment. Several offline evaluation metrics have been adopted in the literature, inspired by ranking metrics prevalent in the field of Information Retrieval. (Normalised) Discounted Cumulative Gain (nDCG) is one such metric that has seen widespread adoption in empirical studies, and higher (n)DCG values have been used to present new methods as the state-of-the-art in top-$n$ recommendation for many years. Our work takes a critical look at this approach, and investigates when we can expect such metrics to approximate the gold standard outcome of an online experiment. We formally present the assumptions that are necessary to consider DCG an unbiased estimator of online reward and provide a derivation for this metric from first principles, highlighting where we deviate from its traditional uses in IR. Importantly, we show that normalising the metric renders it inconsistent, in that even when DCG is unbiased, ranking competing methods by their normalised DCG can invert their relative order. Through a correlation analysis between off- and on-line experiments conducted on a large-scale recommendation platform, we show that our unbiased DCG estimates strongly correlate with online reward, even when some of the metric's inherent assumptions are violated. This statement no longer holds for its normalised variant, suggesting that nDCG's practical utility may be limited.

On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-$n$ Recommendation

TL;DR

This work interrogates the use of (normalised) Discounted Cumulative Gain ((

)) as an offline evaluation metric for top-

recommender systems. It formalises when

acts as an unbiased estimator of online reward within a counterfactual framework and demonstrates that normalising to

destroys this property, potentially reversing model rankings. Through large-scale offline–online experiments, the authors show that

correlates strongly with online reward and offers higher statistical sensitivity to detect improvements, whereas

exhibits weak or negative correlations and reduced sensitivity. The results advocate using

(with appropriate variance-control techniques like clipping) for offline evaluation and model selection, while highlighting the need to revisit the assumptions behind these metrics and to explore extensions to broader online metrics and biases.

Abstract

recommendation for many years. Our work takes a critical look at this approach, and investigates when we can expect such metrics to approximate the gold standard outcome of an online experiment. We formally present the assumptions that are necessary to consider DCG an unbiased estimator of online reward and provide a derivation for this metric from first principles, highlighting where we deviate from its traditional uses in IR. Importantly, we show that normalising the metric renders it inconsistent, in that even when DCG is unbiased, ranking competing methods by their normalised DCG can invert their relative order. Through a correlation analysis between off- and on-line experiments conducted on a large-scale recommendation platform, we show that our unbiased DCG estimates strongly correlate with online reward, even when some of the metric's inherent assumptions are violated. This statement no longer holds for its normalised variant, suggesting that nDCG's practical utility may be limited.

Paper Structure (11 sections, 2 theorems, 11 equations, 2 figures, 2 tables)

This paper contains 11 sections, 2 theorems, 11 equations, 2 figures, 2 tables.

Introduction & Motivation
Background & Related Work
Formalising the problem setting
Discounted Cumulative Gain as an unbiased offline evaluation metric
Normalising DCG is Inconsistent
Experimental Results & Discussion
Offline--Online Metric Correlation (RQ1--3)
Offline--Online Metric Sensitivity (RQ4)
Perspectives going forward
Conclusions & Outlook
Empirical Evidence of (n)DCG Inconsistency on Public Data

Key Result

lemma 1

The Discounted Cumulative Gain (DCG) and Normalised Discounted Cumulative Gain (nDCG) metrics yield consistent relative orders over a competing set of policies $\Omega$ that are being evaluated for a single sample $\bm{x}$. That is,

Figures (2)

Figure 1: Sensitivity measures (y-axis) of (n)DCG for varying values of the capping parameter in IPS (x-axis).
Figure 2: DCG and nDCG exhibit significant disagreement for a standard offline evaluation setup on MovieLens-1M.

Theorems & Definitions (2)

lemma 1
lemma 2

On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-$n$ Recommendation

TL;DR

Abstract

On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-$n$ Recommendation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (2)