Table of Contents
Fetching ...

Adaptively Learning to Select-Rank in Online Platforms

Jingyuan Wang, Perry Dong, Ying Jin, Ruohan Zhan, Zhengyuan Zhou

TL;DR

This research addresses the challenge of adaptively ranking items from a candidate pool for heterogeneous users, a key component in personalizing user experience, and develops a user response model that considers diverse user preferences and the varying effects of item positions.

Abstract

Ranking algorithms are fundamental to various online platforms across e-commerce sites to content streaming services. Our research addresses the challenge of adaptively ranking items from a candidate pool for heterogeneous users, a key component in personalizing user experience. We develop a user response model that considers diverse user preferences and the varying effects of item positions, aiming to optimize overall user satisfaction with the ranked list. We frame this problem within a contextual bandits framework, with each ranked list as an action. Our approach incorporates an upper confidence bound to adjust predicted user satisfaction scores and selects the ranking action that maximizes these adjusted scores, efficiently solved via maximum weight imperfect matching. We demonstrate that our algorithm achieves a cumulative regret bound of $O(d\sqrt{NKT})$ for ranking $K$ out of $N$ items in a $d$-dimensional context space over $T$ rounds, under the assumption that user responses follow a generalized linear model. This regret alleviates dependence on the ambient action space, whose cardinality grows exponentially with $N$ and $K$ (thus rendering direct application of existing adaptive learning algorithms -- such as UCB or Thompson sampling -- infeasible). Experiments conducted on both simulated and real-world datasets demonstrate our algorithm outperforms the baseline.

Adaptively Learning to Select-Rank in Online Platforms

TL;DR

This research addresses the challenge of adaptively ranking items from a candidate pool for heterogeneous users, a key component in personalizing user experience, and develops a user response model that considers diverse user preferences and the varying effects of item positions.

Abstract

Ranking algorithms are fundamental to various online platforms across e-commerce sites to content streaming services. Our research addresses the challenge of adaptively ranking items from a candidate pool for heterogeneous users, a key component in personalizing user experience. We develop a user response model that considers diverse user preferences and the varying effects of item positions, aiming to optimize overall user satisfaction with the ranked list. We frame this problem within a contextual bandits framework, with each ranked list as an action. Our approach incorporates an upper confidence bound to adjust predicted user satisfaction scores and selects the ranking action that maximizes these adjusted scores, efficiently solved via maximum weight imperfect matching. We demonstrate that our algorithm achieves a cumulative regret bound of for ranking out of items in a -dimensional context space over rounds, under the assumption that user responses follow a generalized linear model. This regret alleviates dependence on the ambient action space, whose cardinality grows exponentially with and (thus rendering direct application of existing adaptive learning algorithms -- such as UCB or Thompson sampling -- infeasible). Experiments conducted on both simulated and real-world datasets demonstrate our algorithm outperforms the baseline.
Paper Structure (27 sections, 11 theorems, 110 equations, 3 figures, 1 table, 5 algorithms)

This paper contains 27 sections, 11 theorems, 110 equations, 3 figures, 1 table, 5 algorithms.

Key Result

Theorem 4.1

Fix any $\delta\in(0,1)$, and let $c_1:=\min\{\frac{1}{2K},c_x\}>0$. Suppose Assumption assump:context and assump:regularity hold, and $T_0 \geq \max\{ (\frac{16}{3c_1}+\frac{32(K+N)^2}{N^2c_1})\log\frac{2(d+1)}{\delta} , ~\frac{6\bar{\sigma}^2}{c_1\kappa^2} ( (d+1) \log(1+2T/d) + \log(1/\delta) ) where $\bar{c}=\max{k\in[K]}c_k$. With the proper choice of the initialization phase $T_0$,

Figures (3)

  • Figure 1: The average cumulative regret (with standard variation interval) of UCR and G-MLE in the simulated environment. The figure on the left is the result of the $N=7,K=5$ case; in the middle is the result of the $N=10,K=5$ case; the figure on the right is the result of the $N=K=5$ case.
  • Figure 2: Average relative regret (with standard variation interval) of UCR and G-MLE on the real-world dataset.
  • Figure 3: Visualization of Bipartite Matching of UCR and greedy MLE approach.

Theorems & Definitions (16)

  • Remark 2.1
  • Example 2.3: Watchtime
  • Example 2.4: Revenue
  • Example 2.5: Click-Through-Rate
  • Remark 3.1
  • Theorem 4.1
  • Corollary 4.2
  • Proposition 4.3
  • Lemma 4.4
  • Lemma 3.1
  • ...and 6 more