Adaptively Learning to Select-Rank in Online Platforms

Jingyuan Wang; Perry Dong; Ying Jin; Ruohan Zhan; Zhengyuan Zhou

Adaptively Learning to Select-Rank in Online Platforms

Jingyuan Wang, Perry Dong, Ying Jin, Ruohan Zhan, Zhengyuan Zhou

TL;DR

This research addresses the challenge of adaptively ranking items from a candidate pool for heterogeneous users, a key component in personalizing user experience, and develops a user response model that considers diverse user preferences and the varying effects of item positions.

Abstract

Ranking algorithms are fundamental to various online platforms across e-commerce sites to content streaming services. Our research addresses the challenge of adaptively ranking items from a candidate pool for heterogeneous users, a key component in personalizing user experience. We develop a user response model that considers diverse user preferences and the varying effects of item positions, aiming to optimize overall user satisfaction with the ranked list. We frame this problem within a contextual bandits framework, with each ranked list as an action. Our approach incorporates an upper confidence bound to adjust predicted user satisfaction scores and selects the ranking action that maximizes these adjusted scores, efficiently solved via maximum weight imperfect matching. We demonstrate that our algorithm achieves a cumulative regret bound of $O(d\sqrt{NKT})$ for ranking $K$ out of $N$ items in a $d$-dimensional context space over $T$ rounds, under the assumption that user responses follow a generalized linear model. This regret alleviates dependence on the ambient action space, whose cardinality grows exponentially with $N$ and $K$ (thus rendering direct application of existing adaptive learning algorithms -- such as UCB or Thompson sampling -- infeasible). Experiments conducted on both simulated and real-world datasets demonstrate our algorithm outperforms the baseline.

Adaptively Learning to Select-Rank in Online Platforms

TL;DR

Abstract

for ranking

out of

items in a

-dimensional context space over

rounds, under the assumption that user responses follow a generalized linear model. This regret alleviates dependence on the ambient action space, whose cardinality grows exponentially with

and

(thus rendering direct application of existing adaptive learning algorithms -- such as UCB or Thompson sampling -- infeasible). Experiments conducted on both simulated and real-world datasets demonstrate our algorithm outperforms the baseline.

Paper Structure (27 sections, 11 theorems, 110 equations, 3 figures, 1 table, 5 algorithms)

This paper contains 27 sections, 11 theorems, 110 equations, 3 figures, 1 table, 5 algorithms.

Introduction
Related Works
Problem Setup
User Satisfaction Model
Reward and Outcome Structure
Upper Confidence Ranking: Adaptive Learning-to-Rank Algorithm
Constructing Upper Confidence Bounds
Upper Confidence Ranking via Maximum Weighted Bipartite Matching
Main Result on Cumulative Regret
Proof sketch of Theorem \ref{['thm:n-choose-k-regret']}
Experiments
Empirical Results
UCR consistently outperforms the G-MLE approach across different environments.
UCR maintains its advantage over G-MLE on real-world applications.
Additional Experiment Details
...and 12 more sections

Key Result

Theorem 4.1

Fix any $\delta\in(0,1)$, and let $c_1:=\min\{\frac{1}{2K},c_x\}>0$. Suppose Assumption assump:context and assump:regularity hold, and $T_0 \geq \max\{ (\frac{16}{3c_1}+\frac{32(K+N)^2}{N^2c_1})\log\frac{2(d+1)}{\delta} , ~\frac{6\bar{\sigma}^2}{c_1\kappa^2} ( (d+1) \log(1+2T/d) + \log(1/\delta) ) where $\bar{c}=\max{k\in[K]}c_k$. With the proper choice of the initialization phase $T_0$,

Figures (3)

Figure 1: The average cumulative regret (with standard variation interval) of UCR and G-MLE in the simulated environment. The figure on the left is the result of the $N=7,K=5$ case; in the middle is the result of the $N=10,K=5$ case; the figure on the right is the result of the $N=K=5$ case.
Figure 2: Average relative regret (with standard variation interval) of UCR and G-MLE on the real-world dataset.
Figure 3: Visualization of Bipartite Matching of UCR and greedy MLE approach.

Theorems & Definitions (16)

Remark 2.1
Example 2.3: Watchtime
Example 2.4: Revenue
Example 2.5: Click-Through-Rate
Remark 3.1
Theorem 4.1
Corollary 4.2
Proposition 4.3
Lemma 4.4
Lemma 3.1
...and 6 more

Adaptively Learning to Select-Rank in Online Platforms

TL;DR

Abstract

Adaptively Learning to Select-Rank in Online Platforms

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (16)