Table of Contents
Fetching ...

Analyzing and Enhancing Queue Sampling for Energy-Efficient Remote Control of Bandits

Hiba Dakdouk, Mohamed Sana, Mattia Merluzzi

TL;DR

This paper studies remote control of multi-armed bandits where feedback travels through an unreliable Geo/Geo/1 queue, causing latency and potential losses. It introduces QR-MAB, an adaptation of standard bandit algorithms (e.g., UCB, TS) that updates only on observed feedback and is influenced by a queue-sampling policy. To mitigate detrimental effects like packet stacking, it proposes a stochastic biased sampling strategy that blends focused sampling with randomness, improving regret performance and energy efficiency. Numerical results show that the proposed sampling method can outperform existing queue-based approaches in regret, while delivering favorable energy-reward trade-offs, particularly for Thompson Sampling. The work highlights a practical trade-off between maximizing rewards and minimizing energy consumption and suggests future theoretical regret analysis and extensions to non-stationary environments and age-of-information considerations.

Abstract

In recent years, the integration of communication and control systems has gained significant traction in various domains, ranging from autonomous vehicles to industrial automation and beyond. Multi-armed bandit (MAB) algorithms have proven their effectiveness as a robust framework for solving control problems. In this work, we investigate the use of MAB algorithms to control remote devices, which faces considerable challenges primarily represented by latency and reliability. We analyze the effectiveness of MABs operating in environments where the action feedback from controlled devices is transmitted over an unreliable communication channel and stored in a Geo/Geo/1 queue. We investigate the impact of queue sampling strategies on the MAB performance, and introduce a new stochastic approach. Its performance in terms of regret is evaluated against established algorithms in the literature for both upper confidence bound (UCB) and Thompson Sampling (TS) algorithms. Additionally, we study the trade-off between maximizing rewards and minimizing energy consumption.

Analyzing and Enhancing Queue Sampling for Energy-Efficient Remote Control of Bandits

TL;DR

This paper studies remote control of multi-armed bandits where feedback travels through an unreliable Geo/Geo/1 queue, causing latency and potential losses. It introduces QR-MAB, an adaptation of standard bandit algorithms (e.g., UCB, TS) that updates only on observed feedback and is influenced by a queue-sampling policy. To mitigate detrimental effects like packet stacking, it proposes a stochastic biased sampling strategy that blends focused sampling with randomness, improving regret performance and energy efficiency. Numerical results show that the proposed sampling method can outperform existing queue-based approaches in regret, while delivering favorable energy-reward trade-offs, particularly for Thompson Sampling. The work highlights a practical trade-off between maximizing rewards and minimizing energy consumption and suggests future theoretical regret analysis and extensions to non-stationary environments and age-of-information considerations.

Abstract

In recent years, the integration of communication and control systems has gained significant traction in various domains, ranging from autonomous vehicles to industrial automation and beyond. Multi-armed bandit (MAB) algorithms have proven their effectiveness as a robust framework for solving control problems. In this work, we investigate the use of MAB algorithms to control remote devices, which faces considerable challenges primarily represented by latency and reliability. We analyze the effectiveness of MABs operating in environments where the action feedback from controlled devices is transmitted over an unreliable communication channel and stored in a Geo/Geo/1 queue. We investigate the impact of queue sampling strategies on the MAB performance, and introduce a new stochastic approach. Its performance in terms of regret is evaluated against established algorithms in the literature for both upper confidence bound (UCB) and Thompson Sampling (TS) algorithms. Additionally, we study the trade-off between maximizing rewards and minimizing energy consumption.
Paper Structure (8 sections, 1 theorem, 4 equations, 8 figures, 1 algorithm)

This paper contains 8 sections, 1 theorem, 4 equations, 8 figures, 1 algorithm.

Key Result

Proposition 3.1

Given a Geo/Geo/1 queue of infinite length, with arrival rate $\lambda$, and service rate $\mu$. The expected number of served packets at time $T$ is:

Figures (8)

  • Figure 1: System model of remotely controlled system.
  • Figure 2: Total reward of the UCB algorithm under LIFO and FIFO sampling strategies.
  • Figure 3: UCB total rewards with $\pi_{\delta u}$ with different values of $\alpha$ and $a$ at $\mu=0.3$ and $\lambda=0.6$
  • Figure 4: UCB total rewards following $\pi_{\delta u}$ sampling strategy with $a=1$, $\mu=0.3$ and different values of $\alpha$ and $\lambda$.
  • Figure 5: Cumulative pseudo regret of UCB for different policies with $\lambda=0.8$, $\mu=0.6$.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Proposition 3.1
  • proof