Online Learning and Equilibrium Computation with Ranking Feedback

Mingyang Liu; Yongshan Chen; Zhiyuan Fan; Gabriele Farina; Asuman Ozdaglar; Kaiqing Zhang

Online Learning and Equilibrium Computation with Ranking Feedback

Mingyang Liu, Yongshan Chen, Zhiyuan Fan, Gabriele Farina, Asuman Ozdaglar, Kaiqing Zhang

Abstract

Online learning in arbitrary, and possibly adversarial, environments has been extensively studied in sequential decision-making, and it is closely connected to equilibrium computation in game theory. Most existing online learning algorithms rely on \emph{numeric} utility feedback from the environment, which may be unavailable in human-in-the-loop applications and/or may be restricted by privacy concerns. In this paper, we study an online learning model in which the learner only observes a \emph{ranking} over a set of proposed actions at each timestep. We consider two ranking mechanisms: rankings induced by the \emph{instantaneous} utility at the current timestep, and rankings induced by the \emph{time-average} utility up to the current timestep, under both \emph{full-information} and \emph{bandit} feedback settings. Using the standard external-regret metric, we show that sublinear regret is impossible with instantaneous-utility ranking feedback in general. Moreover, when the ranking model is relatively deterministic, \emph{i.e.}, under the Plackett-Luce model with a temperature that is sufficiently small, sublinear regret is also impossible with time-average utility ranking feedback. We then develop new algorithms that achieve sublinear regret under the additional assumption that the utility sequence has sublinear total variation. Notably, for full-information time-average utility ranking feedback, this additional assumption can be removed. As a consequence, when all players in a normal-form game follow our algorithms, repeated play yields an approximate coarse correlated equilibrium. We also demonstrate the effectiveness of our algorithms in an online large-language-model routing task.

Online Learning and Equilibrium Computation with Ranking Feedback

Abstract

Paper Structure (48 sections, 24 theorems, 132 equations, 11 figures, 1 table, 3 algorithms)

This paper contains 48 sections, 24 theorems, 132 equations, 11 figures, 1 table, 3 algorithms.

Introduction
Contributions.
Related Work
Dueling Bandits.
Reinforcement Learning from Human Feedback (RLHF) and Preference-Based RL.
Learning of Stable Matchings.
Recent Work by *bandit-ranking-feedback.
Preliminaries
Online Learning
Online Learning Algorithms with Numeric Feedback
Online Learning with Ranking Feedback
Hardness Results
Online Learning with \ref{['item:rank-instant']} Feedback
Utility Estimation
Sublinear Regret with \ref{['item:rank-instant']}
...and 33 more sections

Key Result

theorem 1

Consider item:rank-instant. For any $T>0$, temperature $0<\tau\leq 0.1$, and online learning algorithm, there exists a sequence of utilities $\left(\bm{u}^{(t)}\right)_{t=1}^T$ such that $\min\left\{\mathbb{E}\left[R^{(T), {\rm external}}\right], \mathbb{E}\left[R^{(T)}\right]\right\}\geq \Omega\lef

Figures (11)

Figure 1: Two examples of online learning and equilibrium computation with ranking feedback. In (a), an online platform recommends food options to a customer at each timestep and receives a ranking over the proposed items, which it uses to improve recommendation quality. In (b), an online dating app recommends potential matches; users rank the suggested candidates, and the platform leverages these rankings to learn matching equilibria over time.
Figure 2: Regret of \ref{['alg: FTRL']} with \ref{['item:rank-avg']} under bandit feedback for different temperatures $\tau$ and numbers of proposed actions $K$, in the online learning setting.
Figure 3: The exploitability for the full-information feedback setting under both \ref{['item:rank-instant']} and \ref{['item:rank-avg']}. Performance is evaluated across different temperatures $\tau$ and cumulative utility variations $P^{(T)} = T^q$. Each parameter combination is tested $10$ times with different random seeds.
Figure 4: The regret for bandit feedback setting under \ref{['item:rank-instant']} feedback in the online learning setting. The performance is evaluated across different temperatures $\tau$, cumulative utility variations $P^{(T)} = T^q$, and numbers of proposed actions $K$. Each parameter combination is tested $10$ times with different random seeds.
Figure 5: The regret for bandit feedback setting under \ref{['item:rank-avg']} feedback in the online learning setting. The performance is evaluated across different temperatures $\tau$, cumulative utility variations $P^{(T)} = T^q$, and numbers of proposed actions $K$. Each parameter combination is tested $10$ times with different random seeds.
...and 6 more figures

Theorems & Definitions (29)

theorem 1
theorem 2
theorem 3
theorem 4
theorem 5
theorem 6
theorem 7
theorem 8
theorem 9
definition 1: $\epsilon$-CCE
...and 19 more

Online Learning and Equilibrium Computation with Ranking Feedback

Abstract

Online Learning and Equilibrium Computation with Ranking Feedback

Authors

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (29)