Table of Contents
Fetching ...

Incentive-compatible Bandits: Importance Weighting No More

Julian Zimmert, Teodor V. Marinov

TL;DR

The paper addresses incentive-compatible online learning in adversarial bandits, where self-interested experts may misreport to improve their selection chances. It introduces loss-biasing and masking techniques to enable linear-update, incentive-compatible algorithms that achieve near-optimal $O(\sqrt{KT})$ regret, including a loss-sequence-only method without importance weighting and an algorithm with best-of-both-worlds guarantees. It also presents LB-Prod, a loss-estimator-free approach with provable regret bounds, and TS-Prod, a Tsallis-entropy-based method that attains adversarial $O(\sqrt{KT})$ regret and optimal stochastic performance, connecting to stabilized OMD via a perturbation framework. Collectively, the results show that incentive-compatible bandits can match standard bandits in regret performance while simplifying updates and enabling robust performance across stochastic and adversarial regimes, with practical implications for deploying incentive-aware online learning systems.

Abstract

We study the problem of incentive-compatible online learning with bandit feedback. In this class of problems, the experts are self-interested agents who might misrepresent their preferences with the goal of being selected most often. The goal is to devise algorithms which are simultaneously incentive-compatible, that is the experts are incentivised to report their true preferences, and have no regret with respect to the preferences of the best fixed expert in hindsight. \citet{freeman2020no} propose an algorithm in the full information setting with optimal $O(\sqrt{T \log(K)})$ regret and $O(T^{2/3}(K\log(K))^{1/3})$ regret in the bandit setting. In this work we propose the first incentive-compatible algorithms that enjoy $O(\sqrt{KT})$ regret bounds. We further demonstrate how simple loss-biasing allows the algorithm proposed in Freeman et al. 2020 to enjoy $\tilde O(\sqrt{KT})$ regret. As a byproduct of our approach we obtain the first bandit algorithm with nearly optimal regret bounds in the adversarial setting which works entirely on the observed loss sequence without the need for importance-weighted estimators. Finally, we provide an incentive-compatible algorithm that enjoys asymptotically optimal best-of-both-worlds regret guarantees, i.e., logarithmic regret in the stochastic regime as well as worst-case $O(\sqrt{KT})$ regret.

Incentive-compatible Bandits: Importance Weighting No More

TL;DR

The paper addresses incentive-compatible online learning in adversarial bandits, where self-interested experts may misreport to improve their selection chances. It introduces loss-biasing and masking techniques to enable linear-update, incentive-compatible algorithms that achieve near-optimal regret, including a loss-sequence-only method without importance weighting and an algorithm with best-of-both-worlds guarantees. It also presents LB-Prod, a loss-estimator-free approach with provable regret bounds, and TS-Prod, a Tsallis-entropy-based method that attains adversarial regret and optimal stochastic performance, connecting to stabilized OMD via a perturbation framework. Collectively, the results show that incentive-compatible bandits can match standard bandits in regret performance while simplifying updates and enabling robust performance across stochastic and adversarial regimes, with practical implications for deploying incentive-aware online learning systems.

Abstract

We study the problem of incentive-compatible online learning with bandit feedback. In this class of problems, the experts are self-interested agents who might misrepresent their preferences with the goal of being selected most often. The goal is to devise algorithms which are simultaneously incentive-compatible, that is the experts are incentivised to report their true preferences, and have no regret with respect to the preferences of the best fixed expert in hindsight. \citet{freeman2020no} propose an algorithm in the full information setting with optimal regret and regret in the bandit setting. In this work we propose the first incentive-compatible algorithms that enjoy regret bounds. We further demonstrate how simple loss-biasing allows the algorithm proposed in Freeman et al. 2020 to enjoy regret. As a byproduct of our approach we obtain the first bandit algorithm with nearly optimal regret bounds in the adversarial setting which works entirely on the observed loss sequence without the need for importance-weighted estimators. Finally, we provide an incentive-compatible algorithm that enjoys asymptotically optimal best-of-both-worlds regret guarantees, i.e., logarithmic regret in the stochastic regime as well as worst-case regret.
Paper Structure (29 sections, 22 theorems, 94 equations, 1 table)

This paper contains 29 sections, 22 theorems, 94 equations, 1 table.

Key Result

Lemma 1

If $\eta K/\gamma \leq \frac{1}{2}$, the WSU-UX weights $\pi_t$ and $\tilde{\pi}_t$ are valid probability distributions for all $t\in[T]$.

Theorems & Definitions (39)

  • Lemma 1: Lemma 4.1 freeman2020no
  • Lemma 2: Lemma 4.3 freeman2020no
  • Theorem 1
  • proof
  • Theorem 2
  • Lemma 3
  • Lemma 4
  • proof
  • proof
  • Theorem 3
  • ...and 29 more