Debias Can be Unreliable: Mitigating Bias Issue in Evaluating Debiasing Recommendation

Chengbing Wang; Wentao Shi; Jizhi Zhang; Wenjie Wang; Hang Pan; Fuli Feng

Debias Can be Unreliable: Mitigating Bias Issue in Evaluating Debiasing Recommendation

Chengbing Wang, Wentao Shi, Jizhi Zhang, Wenjie Wang, Hang Pan, Fuli Feng

TL;DR

This work addresses the unreliability of evaluating debiasing methods for recommender systems when using randomly-exposed data, especially for small Recall@K. It reveals theoretical and empirical gaps between Recall@$K$ on fully-exposed data and Recall@$ar{K}$ on randomly-exposed data, highlighting the insufficiency of traditional evaluation schemes. The authors introduce Unbiased Recall Evaluation (URE), which uses the randomly exposed data to produce an unbiased estimate of Recall@$K$ on fully-exposed data by thresholding on the $(K+1)$-th item and averaging positive-rate ratios across users, with a formal proof of unbiasedness. Extensive experiments on KuaiRec and Yahoo!R3 demonstrate that URE's estimates align with true Recall@$K$ on full data and that traditional schemes can mislead conclusions about debiasing methods, providing a practical path toward more reliable model evaluation in debiasing research.

Abstract

Recent work has improved recommendation models remarkably by equipping them with debiasing methods. Due to the unavailability of fully-exposed datasets, most existing approaches resort to randomly-exposed datasets as a proxy for evaluating debiased models, employing traditional evaluation scheme to represent the recommendation performance. However, in this study, we reveal that traditional evaluation scheme is not suitable for randomly-exposed datasets, leading to inconsistency between the Recall performance obtained using randomly-exposed datasets and that obtained using fully-exposed datasets. Such inconsistency indicates the potential unreliability of experiment conclusions on previous debiasing techniques and calls for unbiased Recall evaluation using randomly-exposed datasets. To bridge the gap, we propose the Unbiased Recall Evaluation (URE) scheme, which adjusts the utilization of randomly-exposed datasets to unbiasedly estimate the true Recall performance on fully-exposed datasets. We provide theoretical evidence to demonstrate the rationality of URE and perform extensive experiments on real-world datasets to validate its soundness.

Debias Can be Unreliable: Mitigating Bias Issue in Evaluating Debiasing Recommendation

TL;DR

on fully-exposed data and Recall@

on randomly-exposed data, highlighting the insufficiency of traditional evaluation schemes. The authors introduce Unbiased Recall Evaluation (URE), which uses the randomly exposed data to produce an unbiased estimate of Recall@

on fully-exposed data by thresholding on the

-th item and averaging positive-rate ratios across users, with a formal proof of unbiasedness. Extensive experiments on KuaiRec and Yahoo!R3 demonstrate that URE's estimates align with true Recall@

on full data and that traditional schemes can mislead conclusions about debiasing methods, providing a practical path toward more reliable model evaluation in debiasing research.

Abstract

Paper Structure (8 sections, 2 theorems, 3 equations, 2 figures, 2 tables)

This paper contains 8 sections, 2 theorems, 3 equations, 2 figures, 2 tables.

Introduction
Preliminary
Relation of $\text{Recall@}K$ and $\text{Recall@}\overline{K}$
Theoretical Guarantee
Empirical Results
URE scheme
Experiments
Conclusion

Key Result

Theorem 1

Assuming that we have $N^{+}$ positive samples and $N^{-}$ negative samples on $D_{full}$, with a total sample size of $N=N^{+}+N^{-}$, our $D_{rand}$ samples $\overline{N}$ samples from $D_{full}$. We denote the set of all combinations of $D_{rand}$ of size $\overline{N}$ as $\mathcal{S}_{\overline where $E$ denotes the Expectation funtion.

Figures (2)

Figure 1: The correlation coefficients between Recall@$\overline{K}$ on $D_{rand}$ and Recall@$K$ on $D_{full}$. (a) The effect of $\overline{N}$, where $\overline{K}$ is fixed to 5. (b) The effect of $\overline{K}$, where $\overline{N}$ is fixed to 80.
Figure 3: (a) The correlation coefficients between Recall@$K$ and $\widehat{\text{Recall@}K}$ as well as the conventional Recall@$\overline{K}$. (b) The correlation coefficients between Recall@$K$ and $\widehat{\text{Recall@}K}$ for different values of $K$.

Theorems & Definitions (2)

Theorem 1
Theorem 2

Debias Can be Unreliable: Mitigating Bias Issue in Evaluating Debiasing Recommendation

TL;DR

Abstract

Debias Can be Unreliable: Mitigating Bias Issue in Evaluating Debiasing Recommendation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (2)