Mitigating Exposure Bias in Online Learning to Rank Recommendation: A Novel Reward Model for Cascading Bandits

Masoud Mansoury; Bamshad Mobasher; Herke van Hoof

Mitigating Exposure Bias in Online Learning to Rank Recommendation: A Novel Reward Model for Cascading Bandits

Masoud Mansoury, Bamshad Mobasher, Herke van Hoof

TL;DR

Exposure bias in online learning-to-rank causes persistent over-exposure of a subset of items. The authors address this with an Exposure-Aware reward model integrated into Linear Cascading Bandits, using position-aware rewards and a penalization term to encourage exploration of under-exposed items; learning proceeds via ridge regression with a UCB-based item selection. Theoretical guarantees are established, including a high-probability regret bound $R(n) \le 2\alpha K\sqrt{\\frac{dn\\log(1+\\frac{nK}{d\\sigma^2})}{\\log(1+\\frac{1}{\\sigma^2})}}}+1$, yielding $R(n)=\\mathcal{O}(dK\\sqrt{n})$ under suitable parameters, matching the baseline. Empirically, on MovieLens 1M and Yahoo Music, the Exposure-Aware Cascading Bandits reduce exposure bias more effectively than baselines while maintaining or improving accuracy, with weight-function variations (logarithmic, linear, exponential) providing flexibility. The work advances fair, scalable online ranking and can extend to other online bandit frameworks, including Thompson Sampling-based approaches.

Abstract

Exposure bias is a well-known issue in recommender systems where items and suppliers are not equally represented in the recommendation results. This bias becomes particularly problematic over time as a few items are repeatedly over-represented in recommendation lists, leading to a feedback loop that further amplifies this bias. Although extensive research has addressed this issue in model-based or neighborhood-based recommendation algorithms, less attention has been paid to online recommendation models, such as those based on top-K contextual bandits, where recommendation models are dynamically updated with ongoing user feedback. In this paper, we study exposure bias in a class of well-known contextual bandit algorithms known as Linear Cascading Bandits. We analyze these algorithms in their ability to handle exposure bias and provide a fair representation of items in the recommendation results. Our analysis reveals that these algorithms fail to mitigate exposure bias in the long run during the course of ongoing user interactions. We propose an Exposure-Aware reward model that updates the model parameters based on two factors: 1) implicit user feedback and 2) the position of the item in the recommendation list. The proposed model mitigates exposure bias by controlling the utility assigned to the items based on their exposure in the recommendation list. Our experiments with two real-world datasets show that our proposed reward model improves the exposure fairness of the linear cascading bandits over time while maintaining the recommendation accuracy. It also outperforms the current baselines. Finally, we prove a high probability upper regret bound for our proposed model, providing theoretical guarantees for its performance.

Mitigating Exposure Bias in Online Learning to Rank Recommendation: A Novel Reward Model for Cascading Bandits

TL;DR

, yielding

under suitable parameters, matching the baseline. Empirically, on MovieLens 1M and Yahoo Music, the Exposure-Aware Cascading Bandits reduce exposure bias more effectively than baselines while maintaining or improving accuracy, with weight-function variations (logarithmic, linear, exponential) providing flexibility. The work advances fair, scalable online ranking and can extend to other online bandit frameworks, including Thompson Sampling-based approaches.

Abstract

Paper Structure (17 sections, 1 theorem, 26 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 17 sections, 1 theorem, 26 equations, 5 figures, 4 tables, 1 algorithm.

Introduction
Background
Cascading bandit
Measuring exposure fairness
Exposure-Aware Cascading Bandits
Algorithm for learning EACB
Analysis of regret upper-bound
Proof of Theorem 1
Experiments
Datasets
Evaluation metrics and baselines
Simulation and experimental setup
Results
(RQ1) The effect of exploration degree on exposure bias in LinUCB
(RQ2) Comparison to baselines
...and 2 more sections

Key Result

Theorem 1

For any $\sigma>0$, $\left\lVert\theta^*\right\rVert_2\leq1$, and we have,

Figures (5)

Figure 1: Reward distribution of CB and EACB for different weight functions when click is observed at varying positions in the list.
Figure 2: The effect of varying the degree of exploration with $\alpha \in \{0.25,0.75,1,2,5\}$ on the performance of LinUCB in terms of clicks and exposure bias on MovieLens dataset for $d=10$ and $K=10$. Left plot shows the average exploration across all users at each round, exploration is computed using the second term in Eq. \ref{['eq_ucb']}. Right plots: (a) number of observed clicks in each round, b) $n$-step-regret as in Eq. \ref{['regret']}, c-f) fairness metrics computed on accumulated exposure values at each round.
Figure 3: Comparison of LinUCB and EALinUCB with three weight functions in terms of Equality$^{(P)}$ per round for $d=10$, $K=10$, $\alpha=0.25$, and $\gamma=0$. At each round $t$, Equality$^{(P)}$ is computed over the accumulated exposure up to round $t$.
Figure 4: Exposure analysis of our EALinUCB with three different weight functions for $d=10$ and $K=10$. Colorbar shows the percentage increase/decrease in $\overline{clicks}$. Items are sorted based on their exposure ($E^{(P)}$) by LinUCB in descending order from left to right where items in the left-side are the over-exposure ones and items in the right-side are under-exposed ones.
Figure 5: Performance of our EALinUCB with three different weight functions in terms of $\overline{clicks}$ and fairness metrics for varying $\gamma \in \{0,0.001,0.005,0.01,0.05,0.1,0.2\}$ on MovieLens dataset for $d=10$ and $K=10$. The cross shows the performance of LinUCB.

Theorems & Definitions (1)

Theorem 1

Mitigating Exposure Bias in Online Learning to Rank Recommendation: A Novel Reward Model for Cascading Bandits

TL;DR

Abstract

Mitigating Exposure Bias in Online Learning to Rank Recommendation: A Novel Reward Model for Cascading Bandits

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (1)