Table of Contents
Fetching ...

Learning for Bandits under Action Erasures

Osama Hanna, Merve Karakas, Lin F. Yang, Christina Fragouli

TL;DR

This work introduces a novel multi-armed bandit setting with action erasures, where the central learner cannot observe erasures while distributed agents do. It proposes a generic Repeat-the-Instruction wrapper that can augment any MAB algorithm to be robust to erasures, achieving a worst-case regret within a factor $O(1/\sqrt{1-\epsilon})$ of the no-erasure baseline, and demonstrates a practical bound when paired with UCB. Additionally, it develops Lingering SAE (L-SAE), a variant of successive arm elimination designed to tolerate erasures with a regret of $\tilde{O}(\sqrt{KT}+K/(1-\epsilon))$ and a matching lower bound $\Omega(K/(1-\epsilon))$ up to logarithmic factors. Together, these results show that robust learning over erasure channels is achievable with minimal modifications to existing MAB algorithms, and quantify the fundamental trade-offs between horizon, number of arms, and erasure probability. The findings have potential implications for distributed robotics and communication-constrained decision systems where reliable action transmission cannot be guaranteed.

Abstract

We consider a novel multi-arm bandit (MAB) setup, where a learner needs to communicate the actions to distributed agents over erasure channels, while the rewards for the actions are directly available to the learner through external sensors. In our model, while the distributed agents know if an action is erased, the central learner does not (there is no feedback), and thus does not know whether the observed reward resulted from the desired action or not. We propose a scheme that can work on top of any (existing or future) MAB algorithm and make it robust to action erasures. Our scheme results in a worst-case regret over action-erasure channels that is at most a factor of $O(1/\sqrt{1-ε})$ away from the no-erasure worst-case regret of the underlying MAB algorithm, where $ε$ is the erasure probability. We also propose a modification of the successive arm elimination algorithm and prove that its worst-case regret is $\Tilde{O}(\sqrt{KT}+K/(1-ε))$, which we prove is optimal by providing a matching lower bound.

Learning for Bandits under Action Erasures

TL;DR

This work introduces a novel multi-armed bandit setting with action erasures, where the central learner cannot observe erasures while distributed agents do. It proposes a generic Repeat-the-Instruction wrapper that can augment any MAB algorithm to be robust to erasures, achieving a worst-case regret within a factor of the no-erasure baseline, and demonstrates a practical bound when paired with UCB. Additionally, it develops Lingering SAE (L-SAE), a variant of successive arm elimination designed to tolerate erasures with a regret of and a matching lower bound up to logarithmic factors. Together, these results show that robust learning over erasure channels is achievable with minimal modifications to existing MAB algorithms, and quantify the fundamental trade-offs between horizon, number of arms, and erasure probability. The findings have potential implications for distributed robotics and communication-constrained decision systems where reliable action transmission cannot be guaranteed.

Abstract

We consider a novel multi-arm bandit (MAB) setup, where a learner needs to communicate the actions to distributed agents over erasure channels, while the rewards for the actions are directly available to the learner through external sensors. In our model, while the distributed agents know if an action is erased, the central learner does not (there is no feedback), and thus does not know whether the observed reward resulted from the desired action or not. We propose a scheme that can work on top of any (existing or future) MAB algorithm and make it robust to action erasures. Our scheme results in a worst-case regret over action-erasure channels that is at most a factor of away from the no-erasure worst-case regret of the underlying MAB algorithm, where is the erasure probability. We also propose a modification of the successive arm elimination algorithm and prove that its worst-case regret is , which we prove is optimal by providing a matching lower bound.
Paper Structure (8 sections, 5 theorems, 22 equations, 1 table, 2 algorithms)

This paper contains 8 sections, 5 theorems, 22 equations, 1 table, 2 algorithms.

Key Result

Theorem 1

Let ALG be a MAB algorithm with expected regret upper bounded by $R^{\text{ALG}}_T(\{\Delta_i\}_{i=1}^K)$ for any instance with gaps $\{\Delta_i\}_{i=1}^K$. For $\alpha = \lceil 2 \log T / \log( \frac{1}{\epsilon})\rceil$, using Repeat-the-Instruction on top of ALG achieves an expected regret $\math where the expectation is over the randomness in the MAB instance, erasures, and algorithm.

Theorems & Definitions (5)

  • Theorem 1
  • Corollary 1
  • Corollary 2
  • Theorem 2
  • Theorem 3