Misalignment, Learning, and Ranking: Harnessing Users Limited Attention

Arpit Agarwal; Rad Niazadeh; Prathamesh Patil

Misalignment, Learning, and Ranking: Harnessing Users Limited Attention

Arpit Agarwal, Rad Niazadeh, Prathamesh Patil

TL;DR

The paper addresses learning-to-rank under misaligned platform payoffs and limited user attention. It develops two principal frameworks: (i) stochastic item payoffs with adversarial attention windows, yielding an optimal instance-dependent $O(\log T)$ regret via an active-elimination approach, and a matching lower bound in stationary utilities; and (ii) adversarial payoffs with stochastic windows, achieving a computationally efficient $O(\sqrt{nT})$ regret by reducing to online linear optimization over admissible marginal probabilities and rounding them to permutations. A key contribution is a polynomial-time representation of the feasible probability space for item selection and a rounding algorithm (RFSM) that decomposes fractional policies into permutations. The results show that, by exploiting the combinatorial structure of rankings and limited attention, the platform can attain near-optimal learning performance without explicit incentive mechanisms, with practical extensions to unknown utilities and delayed feedback. Overall, the work bridges learning-to-rank, bandit theory, and combinatorial optimization to address misalignment challenges in real-world ranking systems.

Abstract

In digital health and EdTech, recommendation systems face a significant challenge: users often choose impulsively, in ways that conflict with the platform's long-term payoffs. This misalignment makes it difficult to effectively learn to rank items, as it may hinder exploration of items with greater long-term payoffs. Our paper tackles this issue by utilizing users' limited attention spans. We propose a model where a platform presents items with unknown payoffs to the platform in a ranked list to $T$ users over time. Each user selects an item by first considering a prefix window of these ranked items and then picking the highest preferred item in that window (and the platform observes its payoff for this item). We study the design of online bandit algorithms that obtain vanishing regret against hindsight optimal benchmarks. We first consider adversarial window sizes and stochastic iid payoffs. We design an active-elimination-based algorithm that achieves an optimal instance-dependent regret bound of $O(\log(T))$, by showing matching regret upper and lower bounds. The key idea is using the combinatorial structure of the problem to either obtain a large payoff from each item or to explore by getting a sample from that item. This method systematically narrows down the item choices to enhance learning efficiency and payoff. Second, we consider adversarial payoffs and stochastic iid window sizes. We start from the full-information problem of finding the permutation that maximizes the expected payoff. By a novel combinatorial argument, we characterize the polytope of admissible item selection probabilities by a permutation and show it has a polynomial-size representation. Using this representation, we show how standard algorithms for adversarial online linear optimization in the space of admissible probabilities can be used to obtain a polynomial-time algorithm with $O(\sqrt{T})$ regret.

Misalignment, Learning, and Ranking: Harnessing Users Limited Attention

TL;DR

regret via an active-elimination approach, and a matching lower bound in stationary utilities; and (ii) adversarial payoffs with stochastic windows, achieving a computationally efficient

regret by reducing to online linear optimization over admissible marginal probabilities and rounding them to permutations. A key contribution is a polynomial-time representation of the feasible probability space for item selection and a rounding algorithm (RFSM) that decomposes fractional policies into permutations. The results show that, by exploiting the combinatorial structure of rankings and limited attention, the platform can attain near-optimal learning performance without explicit incentive mechanisms, with practical extensions to unknown utilities and delayed feedback. Overall, the work bridges learning-to-rank, bandit theory, and combinatorial optimization to address misalignment challenges in real-world ranking systems.

Abstract

users over time. Each user selects an item by first considering a prefix window of these ranked items and then picking the highest preferred item in that window (and the platform observes its payoff for this item). We study the design of online bandit algorithms that obtain vanishing regret against hindsight optimal benchmarks. We first consider adversarial window sizes and stochastic iid payoffs. We design an active-elimination-based algorithm that achieves an optimal instance-dependent regret bound of

, by showing matching regret upper and lower bounds. The key idea is using the combinatorial structure of the problem to either obtain a large payoff from each item or to explore by getting a sample from that item. This method systematically narrows down the item choices to enhance learning efficiency and payoff. Second, we consider adversarial payoffs and stochastic iid window sizes. We start from the full-information problem of finding the permutation that maximizes the expected payoff. By a novel combinatorial argument, we characterize the polytope of admissible item selection probabilities by a permutation and show it has a polynomial-size representation. Using this representation, we show how standard algorithms for adversarial online linear optimization in the space of admissible probabilities can be used to obtain a polynomial-time algorithm with

regret.

Paper Structure (20 sections, 11 theorems, 26 equations, 3 algorithms)

This paper contains 20 sections, 11 theorems, 26 equations, 3 algorithms.

Introduction
Further Related Work
Preliminaries
Stochastic Payoffs and Adversarial Attention Windows
Model and Notations
The Upper Bound
The Algorithm
Regret Analysis
The Lower Bound
Adversarial Payoffs and Stochastic Windows
Warm Up: Sub-optimal Regret by Inducing Exploration
Admissible Selection Probabilities
The Rounding Algorithm
The Algorithm
Conclusion
...and 5 more sections

Key Result

Theorem 3.1

Given items $[n]$ with Gaussian payoff distributions $\{\mathcal{N}(\mu_i,1)\}_{i \in [n]}$, let $\{w^t\}_{t>0}$ and $\{u^t_i\}_{i\in [n],t>0}$ be any arbitrary sequence of attention window lengths and utility orderings, respectively, possibly adversarial and adaptively chosen. Then with probability where for any pair $i,j\in [n]$ of items, $\Delta_{ij} = \mu_i - \mu_j$ is the gap in expected payo

Theorems & Definitions (17)

Theorem 3.1
Theorem 3.2
Lemma 3.3
proof
Lemma 3.4
proof
proof : Proof of Theorem \ref{['thm:stochastic_main']}
Theorem 3.5
Corollary 3.6
Theorem 4.1
...and 7 more

Misalignment, Learning, and Ranking: Harnessing Users Limited Attention

TL;DR

Abstract

Misalignment, Learning, and Ranking: Harnessing Users Limited Attention

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (17)