Table of Contents
Fetching ...

FIRE: A Failure-Adaptive Reinforcement Learning Framework for Edge Computing Migrations

Marie Siew, Shikhar Sharma, Zekai Li, Kun Guo, Chao Xu, Tania Lorido-Botran, Tony Q. S. Quek, Carlee Joe-Wong

TL;DR

FIRE targets resilience in edge computing migrations by addressing rare server failures through a digital-twin RL framework that tilts learning toward high-impact rare events. It introduces ImRE, an importance-sampling based Q-learning method, and extends to deep RL variants ImDQL and ImACRE for large-scale networks, plus RiTA to accommodate heterogeneous risk tolerances. The approach provides theoretical guarantees (boundedness and convergence) and shows via trace-driven simulations that FIRE reduces failure-related costs compared with vanilla RL and greedy baselines, at times trading off normal-state performance. The framework is applicable to both individual and shared service profiles and can be extended to other networking problems with rare but consequential events.

Abstract

In edge computing, users' service profiles are migrated due to user mobility. Reinforcement learning (RL) frameworks have been proposed to do so, often trained on simulated data. However, existing RL frameworks overlook occasional server failures, which although rare, impact latency-sensitive applications like autonomous driving and real-time obstacle detection. Nevertheless, these failures (rare events), being not adequately represented in historical training data, pose a challenge for data-driven RL algorithms. As it is impractical to adjust failure frequency in real-world applications for training, we introduce FIRE, a framework that adapts to rare events by training a RL policy in an edge computing digital twin environment. We propose ImRE, an importance sampling-based Q-learning algorithm, which samples rare events proportionally to their impact on the value function. FIRE considers delay, migration, failure, and backup placement costs across individual and shared service profiles. We prove ImRE's boundedness and convergence to optimality. Next, we introduce novel deep Q-learning (ImDQL) and actor critic (ImACRE) versions of our algorithm to enhance scalability. We extend our framework to accommodate users with varying risk tolerances. Through trace driven experiments, we show that FIRE reduces costs compared to vanilla RL and the greedy baseline in the event of failures.

FIRE: A Failure-Adaptive Reinforcement Learning Framework for Edge Computing Migrations

TL;DR

FIRE targets resilience in edge computing migrations by addressing rare server failures through a digital-twin RL framework that tilts learning toward high-impact rare events. It introduces ImRE, an importance-sampling based Q-learning method, and extends to deep RL variants ImDQL and ImACRE for large-scale networks, plus RiTA to accommodate heterogeneous risk tolerances. The approach provides theoretical guarantees (boundedness and convergence) and shows via trace-driven simulations that FIRE reduces failure-related costs compared with vanilla RL and greedy baselines, at times trading off normal-state performance. The framework is applicable to both individual and shared service profiles and can be extended to other networking problems with rare but consequential events.

Abstract

In edge computing, users' service profiles are migrated due to user mobility. Reinforcement learning (RL) frameworks have been proposed to do so, often trained on simulated data. However, existing RL frameworks overlook occasional server failures, which although rare, impact latency-sensitive applications like autonomous driving and real-time obstacle detection. Nevertheless, these failures (rare events), being not adequately represented in historical training data, pose a challenge for data-driven RL algorithms. As it is impractical to adjust failure frequency in real-world applications for training, we introduce FIRE, a framework that adapts to rare events by training a RL policy in an edge computing digital twin environment. We propose ImRE, an importance sampling-based Q-learning algorithm, which samples rare events proportionally to their impact on the value function. FIRE considers delay, migration, failure, and backup placement costs across individual and shared service profiles. We prove ImRE's boundedness and convergence to optimality. Next, we introduce novel deep Q-learning (ImDQL) and actor critic (ImACRE) versions of our algorithm to enhance scalability. We extend our framework to accommodate users with varying risk tolerances. Through trace driven experiments, we show that FIRE reduces costs compared to vanilla RL and the greedy baseline in the event of failures.
Paper Structure (19 sections, 5 theorems, 47 equations, 8 figures, 3 tables, 4 algorithms)

This paper contains 19 sections, 5 theorems, 47 equations, 8 figures, 3 tables, 4 algorithms.

Key Result

Proposition 1

The number of migration path possibilities grows at least fast as $O(N^T)$.

Figures (8)

  • Figure 1: Users move across the network. Each user has its individual service profile, which has a backup, in case the server where their service profile is at fails.To evaluate system resilience and optimize failure-handling algorithms and strategies without incurring real-world costs, the setup is modeled within a digital twin.
  • Figure 2: Multiple users (who are mobile across the network) share a service profile (SP) for their task, e.g. a common game environment or neural network. For instance, the users in red share the SP in red, which has multiple copies in the network.
  • Figure 3: Algorithm Framework: Our importance sampling based reinforcement learning algorithms learns offline in a digital twin setup, to avoid experiencing the real cost of rare events. The converged policy is applied to online scenarios where rare events occur at their true rate.
  • Figure 4: Convergence graphs:Here, FIRE-ImRE and FIRE-ImACRE are applied to the a special case of the scenario (Section \ref{['section:SysModel']}) where every user has their own service profile (the single user case), and FIRE-ImDQL is applied to the scenario where users share service profiles (Section \ref{['section:SysModel2']}). All three variations of our algorithm converge.
  • Figure 5: Single user service migration scenario: Comparison of our actor critic algorithm importance sampling FIRE-ImACRE, with actor critic versions of the baselines NIS, WBA and RES, in an online scenario.FIRE-ImACRE leads to lower costs on average and in rare failure states, but higher storage and delay costs in normal states.
  • ...and 3 more figures

Theorems & Definitions (10)

  • Definition 1
  • Proposition 1
  • proof
  • Theorem 2
  • proof
  • Corollary 3
  • Theorem 4
  • proof
  • Theorem 5
  • proof