Learning for Bandits under Action Erasures
Osama Hanna, Merve Karakas, Lin F. Yang, Christina Fragouli
TL;DR
This work introduces a novel multi-armed bandit setting with action erasures, where the central learner cannot observe erasures while distributed agents do. It proposes a generic Repeat-the-Instruction wrapper that can augment any MAB algorithm to be robust to erasures, achieving a worst-case regret within a factor $O(1/\sqrt{1-\epsilon})$ of the no-erasure baseline, and demonstrates a practical bound when paired with UCB. Additionally, it develops Lingering SAE (L-SAE), a variant of successive arm elimination designed to tolerate erasures with a regret of $\tilde{O}(\sqrt{KT}+K/(1-\epsilon))$ and a matching lower bound $\Omega(K/(1-\epsilon))$ up to logarithmic factors. Together, these results show that robust learning over erasure channels is achievable with minimal modifications to existing MAB algorithms, and quantify the fundamental trade-offs between horizon, number of arms, and erasure probability. The findings have potential implications for distributed robotics and communication-constrained decision systems where reliable action transmission cannot be guaranteed.
Abstract
We consider a novel multi-arm bandit (MAB) setup, where a learner needs to communicate the actions to distributed agents over erasure channels, while the rewards for the actions are directly available to the learner through external sensors. In our model, while the distributed agents know if an action is erased, the central learner does not (there is no feedback), and thus does not know whether the observed reward resulted from the desired action or not. We propose a scheme that can work on top of any (existing or future) MAB algorithm and make it robust to action erasures. Our scheme results in a worst-case regret over action-erasure channels that is at most a factor of $O(1/\sqrt{1-ε})$ away from the no-erasure worst-case regret of the underlying MAB algorithm, where $ε$ is the erasure probability. We also propose a modification of the successive arm elimination algorithm and prove that its worst-case regret is $\Tilde{O}(\sqrt{KT}+K/(1-ε))$, which we prove is optimal by providing a matching lower bound.
