Incentive-compatible Bandits: Importance Weighting No More

Julian Zimmert; Teodor V. Marinov

Incentive-compatible Bandits: Importance Weighting No More

Julian Zimmert, Teodor V. Marinov

TL;DR

The paper addresses incentive-compatible online learning in adversarial bandits, where self-interested experts may misreport to improve their selection chances. It introduces loss-biasing and masking techniques to enable linear-update, incentive-compatible algorithms that achieve near-optimal $O(\sqrt{KT})$ regret, including a loss-sequence-only method without importance weighting and an algorithm with best-of-both-worlds guarantees. It also presents LB-Prod, a loss-estimator-free approach with provable regret bounds, and TS-Prod, a Tsallis-entropy-based method that attains adversarial $O(\sqrt{KT})$ regret and optimal stochastic performance, connecting to stabilized OMD via a perturbation framework. Collectively, the results show that incentive-compatible bandits can match standard bandits in regret performance while simplifying updates and enabling robust performance across stochastic and adversarial regimes, with practical implications for deploying incentive-aware online learning systems.

Abstract

We study the problem of incentive-compatible online learning with bandit feedback. In this class of problems, the experts are self-interested agents who might misrepresent their preferences with the goal of being selected most often. The goal is to devise algorithms which are simultaneously incentive-compatible, that is the experts are incentivised to report their true preferences, and have no regret with respect to the preferences of the best fixed expert in hindsight. \citet{freeman2020no} propose an algorithm in the full information setting with optimal $O(\sqrt{T \log(K)})$ regret and $O(T^{2/3}(K\log(K))^{1/3})$ regret in the bandit setting. In this work we propose the first incentive-compatible algorithms that enjoy $O(\sqrt{KT})$ regret bounds. We further demonstrate how simple loss-biasing allows the algorithm proposed in Freeman et al. 2020 to enjoy $\tilde O(\sqrt{KT})$ regret. As a byproduct of our approach we obtain the first bandit algorithm with nearly optimal regret bounds in the adversarial setting which works entirely on the observed loss sequence without the need for importance-weighted estimators. Finally, we provide an incentive-compatible algorithm that enjoys asymptotically optimal best-of-both-worlds regret guarantees, i.e., logarithmic regret in the stochastic regime as well as worst-case $O(\sqrt{KT})$ regret.

Incentive-compatible Bandits: Importance Weighting No More

TL;DR

regret, including a loss-sequence-only method without importance weighting and an algorithm with best-of-both-worlds guarantees. It also presents LB-Prod, a loss-estimator-free approach with provable regret bounds, and TS-Prod, a Tsallis-entropy-based method that attains adversarial

regret and optimal stochastic performance, connecting to stabilized OMD via a perturbation framework. Collectively, the results show that incentive-compatible bandits can match standard bandits in regret performance while simplifying updates and enabling robust performance across stochastic and adversarial regimes, with practical implications for deploying incentive-aware online learning systems.

Abstract

regret and

regret in the bandit setting. In this work we propose the first incentive-compatible algorithms that enjoy

regret bounds. We further demonstrate how simple loss-biasing allows the algorithm proposed in Freeman et al. 2020 to enjoy

regret. As a byproduct of our approach we obtain the first bandit algorithm with nearly optimal regret bounds in the adversarial setting which works entirely on the observed loss sequence without the need for importance-weighted estimators. Finally, we provide an incentive-compatible algorithm that enjoys asymptotically optimal best-of-both-worlds regret guarantees, i.e., logarithmic regret in the stochastic regime as well as worst-case

regret.

Paper Structure (29 sections, 22 theorems, 94 equations, 1 table)

This paper contains 29 sections, 22 theorems, 94 equations, 1 table.

Introduction
Problem setting and related work
Adversarial bandits
Incentive-compatible online learning
Best of both worlds
Modifying WSU-UX for nearly optimal regret guarantees
Intuition on biasing the update and the Prod family of algorithms
Importance weighting free adversarial MAB with LB-Prod
Analysis of LB-Prod
Intuition of LB-Prod
Best of both worlds algorithms
TS-Prod
Analysis of TS-Prod
Reduction to $1/2$-Tsallis OMD
Discussion
...and 14 more sections

Key Result

Lemma 1

If $\eta K/\gamma \leq \frac{1}{2}$, the WSU-UX weights $\pi_t$ and $\tilde{\pi}_t$ are valid probability distributions for all $t\in[T]$.

Theorems & Definitions (39)

Lemma 1: Lemma 4.1 freeman2020no
Lemma 2: Lemma 4.3 freeman2020no
Theorem 1
proof
Theorem 2
Lemma 3
Lemma 4
proof
proof
Theorem 3
...and 29 more

Incentive-compatible Bandits: Importance Weighting No More

TL;DR

Abstract

Incentive-compatible Bandits: Importance Weighting No More

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (39)