Reactive Orchestration for Hierarchical Federated Learning Under a Communication Cost Budget
Ivan Čilić, Anna Lackinger, Pantelis Frangoudis, Ivana Podnar Žarko, Alireza Furutanpey, Ilir Murturi, Schahram Dustdar
TL;DR
This work tackles the challenge of deploying Hierarchical Federated Learning (HFL) in dynamic computing continua under a communication budget. It proposes an adaptive HFL orchestration framework with a reconfiguration cost model and a Reactive Reconfiguration Validation Algorithm (RVA) that predicts the impact of runtime changes and reverts decisions when needed. The key contributions include a two-level orchestration design integrating service-specific decisions with a general-purpose orchestrator, a formalized reconfiguration cost model separating change costs from post-change per-round costs, and RVA for runtime validation. Experimental results on a realistic 13-node K3s cluster using CIFAR-10 show RVA can enhance model accuracy within budget while promptly reacting to churn with low overhead, underscoring practical viability for CC deployments.
Abstract
Deploying a Hierarchical Federated Learning (HFL) pipeline across the computing continuum (CC) requires careful organization of participants into a hierarchical structure with intermediate aggregation nodes between FL clients and the global FL server. This is challenging to achieve due to (i) cost constraints, (ii) varying data distributions, and (iii) the volatile operating environment of the CC. In response to these challenges, we present a framework for the adaptive orchestration of HFL pipelines, designed to be reactive to client churn and infrastructure-level events, while balancing communication cost and ML model accuracy. Our mechanisms identify and react to events that cause HFL reconfiguration actions at runtime, building on multi-level monitoring information (model accuracy, resource availability, resource cost). Moreover, our framework introduces a generic methodology for estimating reconfiguration costs to continuously re-evaluate the quality of adaptation actions, while being extensible to optimize for various HFL performance criteria. By extending the Kubernetes ecosystem, our framework demonstrates the ability to react promptly and effectively to changes in the operating environment, making the best of the available communication cost budget and effectively balancing costs and ML performance at runtime.
