Table of Contents
Fetching ...

Distributed No-Regret Learning for Multi-Stage Systems with End-to-End Bandit Feedback

I-Hong Hou

TL;DR

The paper addresses learning optimal policies in multi-stage systems where end-to-end bandit feedback is available and actions are distributed across stages. It introduces the Normalized Exponential Gradient approach for complete feedback and the ε-EXP3 algorithm for end-to-end feedback, addressing the exploration-exploitation-education trilemma. Theoretical results show sublinear regret scaling as $O\left(T^{\frac{L}{L+1}}\right)$ (with an anytime variant) and a matching lower bound of $\Omega\left(T^{\frac{L-1}{L}}\right)$ for a restricted class, highlighting the essential role of education. Empirical evaluations in mobile edge computing and multi-hop networks validate the method's advantage over traditional no-regret policies and demonstrate robust performance under end-to-end delays and varying depths.

Abstract

This paper studies multi-stage systems with end-to-end bandit feedback. In such systems, each job needs to go through multiple stages, each managed by a different agent, before generating an outcome. Each agent can only control its own action and learn the final outcome of the job. It has neither knowledge nor control on actions taken by agents in the next stage. The goal of this paper is to develop distributed online learning algorithms that achieve sublinear regret in adversarial environments. The setting of this paper significantly expands the traditional multi-armed bandit problem, which considers only one agent and one stage. In addition to the exploration-exploitation dilemma in the traditional multi-armed bandit problem, we show that the consideration of multiple stages introduces a third component, education, where an agent needs to choose its actions to facilitate the learning of agents in the next stage. To solve this newly introduced exploration-exploitation-education trilemma, we propose a simple distributed online learning algorithm, $ε-$EXP3. We theoretically prove that the $ε-$EXP3 algorithm is a no-regret policy that achieves sublinear regret. Simulation results show that the $ε-$EXP3 algorithm significantly outperforms existing no-regret online learning algorithms for the traditional multi-armed bandit problem.

Distributed No-Regret Learning for Multi-Stage Systems with End-to-End Bandit Feedback

TL;DR

The paper addresses learning optimal policies in multi-stage systems where end-to-end bandit feedback is available and actions are distributed across stages. It introduces the Normalized Exponential Gradient approach for complete feedback and the ε-EXP3 algorithm for end-to-end feedback, addressing the exploration-exploitation-education trilemma. Theoretical results show sublinear regret scaling as (with an anytime variant) and a matching lower bound of for a restricted class, highlighting the essential role of education. Empirical evaluations in mobile edge computing and multi-hop networks validate the method's advantage over traditional no-regret policies and demonstrate robust performance under end-to-end delays and varying depths.

Abstract

This paper studies multi-stage systems with end-to-end bandit feedback. In such systems, each job needs to go through multiple stages, each managed by a different agent, before generating an outcome. Each agent can only control its own action and learn the final outcome of the job. It has neither knowledge nor control on actions taken by agents in the next stage. The goal of this paper is to develop distributed online learning algorithms that achieve sublinear regret in adversarial environments. The setting of this paper significantly expands the traditional multi-armed bandit problem, which considers only one agent and one stage. In addition to the exploration-exploitation dilemma in the traditional multi-armed bandit problem, we show that the consideration of multiple stages introduces a third component, education, where an agent needs to choose its actions to facilitate the learning of agents in the next stage. To solve this newly introduced exploration-exploitation-education trilemma, we propose a simple distributed online learning algorithm, EXP3. We theoretically prove that the EXP3 algorithm is a no-regret policy that achieves sublinear regret. Simulation results show that the EXP3 algorithm significantly outperforms existing no-regret online learning algorithms for the traditional multi-armed bandit problem.
Paper Structure (16 sections, 9 theorems, 27 equations, 7 figures, 3 algorithms)

This paper contains 16 sections, 9 theorems, 27 equations, 7 figures, 3 algorithms.

Key Result

Lemma 1

If $y_n[j,\tau]\geq0$ for all $j\in \mathcal{C}_i$ and $\tau\in[1, T]$, then the expected total cost incurred by $i$ given $\mathcal{Y}_n[i]$ is upper-bounded by: Moreover, if $y_n[j,\tau]\in[0,1],\forall j\in \mathcal{C}_i, \tau\in[1,T]$, then setting $\eta_i=\sqrt{\frac{\log |\mathcal{C}_i|}{T}}$ yields: $\Box$

Figures (7)

  • Figure 1: A mobile edge computing system and its tree model
  • Figure 2: System illustration for establishing a lower bound
  • Figure 3: Time-average regrets under various system parameters
  • Figure 4: A system with $D=2, L=2$ and $p_{min}=0.2$
  • Figure 5: Transient behaviors of the system in Fig. \ref{['fig:simu_example']} with $T=5\times10^6$.
  • ...and 2 more figures

Theorems & Definitions (25)

  • Definition 1
  • Lemma 1: shalev2012online, Theorem 2.22
  • Theorem 1
  • proof
  • Remark 1
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • ...and 15 more