Resilient and Reliable Cloud Network Control for Mission-Critical Latency-Sensitive Service Chains
Chin-Wei Huang, Jaime Llorca, Antonia M. Tulino, Andreas F. Molisch
TL;DR
The paper tackles the challenge of guaranteeing both reliability and resilience for latency-sensitive service chains in cloud-edge networks under strict per-packet end-to-end deadlines. It formulates the multi-commodity least-cost resilient and reliable network control problem (MC-LC-ResRNC) and introduces the MC-ResRCNC algorithm, which combines Lyapunov drift-plus-penalty optimization with flow matching over layered service graphs to achieve reliable steady-state throughput and rapid post-outage recovery. The approach extends prior RCNC methods by incorporating resilience actions at outages and a virtual-capacity update mechanism, demonstrating superior resilience and switching stability in numerical experiments on the Abilene topology, albeit with higher cost under normal operation. The results offer practical guidance on outage handling and arrival-rate management to sustain timely throughput in mission-critical networks, with implications for distributed cloud, edge, and network-function-chaining deployments.
Abstract
The proliferation of mission-critical latency-sensitive services has intensified the demand for next-generation cloud-integrated networks to guarantee both reliable and resilient service delivery. While reliability imposes timely-throughput requirements, i.e., percentage of packets to be delivered within a prescribed per-packet deadline, resilience relates to the network's ability to swiftly recover timely-throughput performance following an outage event, such as node or link failures. While recent studies have increasingly focused on designing reliable network control policies, a comprehensive solution that combines reliable and resilient network control has yet to be fully explored. This paper formulates the multi-commodity least-cost resilient and reliable network control (MC-LC-ResRNC) problem as a stochastic control problem with long and short-term timely throughput constraints. We then present a solution through the Multi-Commodity Resilient and Reliable Cloud Network Control (MC-ResRCNC) algorithm and show through numerical experiments that it jointly ensures reliability under normal conditions and resilience upon network failure.
