Table of Contents
Fetching ...

An LP-based Sampling Policy for Multi-Armed Bandits with Side-Observations and Stochastic Availability

Ashutosh Soni, Peizhong Ju, Atilla Eryilmaz, Ness B. Shroff

Abstract

We study the stochastic multi-armed bandit (MAB) problem where an underlying network structure enables side-observations across related actions. We use a bipartite graph to link actions to a set of unknowns, such that selecting an action reveals observations for all the unknowns it is connected to. While previous works rely on the assumption that all actions are permanently accessible, we investigate the more practical setting of stochastic availability, where the set of feasible actions (the "activation set") varies dynamically in each round. This framework models real-world systems with both structural dependencies and volatility, such as social networks where users provide side-information about their peers' preferences, yet are not always online to be queried. To address this challenge, we propose UCB-LP-A, a novel policy that leverages a Linear Programming (LP) approach to optimize exploration-exploitation trade-offs under stochastic availability. Unlike standard network bandit algorithms that assume constant access, UCB-LP-A computes an optimal sampling distribution over the realizable activation sets, ensuring that the necessary observations are gathered using only the currently active arms. We derive a theoretical upper bound on the regret of our policy, characterizing the impact of both the network structure and the activation probabilities. Finally, we demonstrate through numerical simulations that UCB-LP-A significantly outperforms existing heuristics that ignore either the side-information or the availability constraints.

An LP-based Sampling Policy for Multi-Armed Bandits with Side-Observations and Stochastic Availability

Abstract

We study the stochastic multi-armed bandit (MAB) problem where an underlying network structure enables side-observations across related actions. We use a bipartite graph to link actions to a set of unknowns, such that selecting an action reveals observations for all the unknowns it is connected to. While previous works rely on the assumption that all actions are permanently accessible, we investigate the more practical setting of stochastic availability, where the set of feasible actions (the "activation set") varies dynamically in each round. This framework models real-world systems with both structural dependencies and volatility, such as social networks where users provide side-information about their peers' preferences, yet are not always online to be queried. To address this challenge, we propose UCB-LP-A, a novel policy that leverages a Linear Programming (LP) approach to optimize exploration-exploitation trade-offs under stochastic availability. Unlike standard network bandit algorithms that assume constant access, UCB-LP-A computes an optimal sampling distribution over the realizable activation sets, ensuring that the necessary observations are gathered using only the currently active arms. We derive a theoretical upper bound on the regret of our policy, characterizing the impact of both the network structure and the activation probabilities. Finally, we demonstrate through numerical simulations that UCB-LP-A significantly outperforms existing heuristics that ignore either the side-information or the availability constraints.

Paper Structure

This paper contains 27 sections, 6 theorems, 40 equations, 5 figures, 2 algorithms.

Key Result

Theorem 1

For all action $j\in\mathcal{K}_a,\forall a\in[A]$, define round $m_{j,a} := \min\{m \in \mathcal{M}: \tilde{\Delta}_m <\Delta_{j,a}/2\}$, $\mathcal{G}_{m,a} = \{j\in\mathcal{K}_a:m_{j,a}\ge m\}$ and $\bar{\mathcal{G}}_m=\{\mathcal{G}_{m,1},...,\mathcal{G}_{m,A}\}$. and $\bar{m} := \max \Bigl\{ m \i where $\gamma_{j,a} = \left(\dfrac{p_az^*_{j,a}Z^*_{\max}}{Z^*_a}\right)$ with $Z^*_a = \sum_{j\in\

Figures (5)

  • Figure 1: Example of a social network with 2 activation sets and side information.
  • Figure 2: Bipartite graph for the example of targeting users in an online social network.
  • Figure 3: Regret comparison for a simulated social network generated using Barabási–Albert (BA) model with $m=3$.
  • Figure 4: Regret comparison for 1000 user subgraph from Facebook Dataset.
  • Figure 5: Example routing problem and regret comparison.

Theorems & Definitions (13)

  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • proof
  • proof
  • Lemma 1: Chernoff-Hoeffding Inequality
  • Lemma 2
  • proof
  • Lemma 3
  • ...and 3 more