Inference Load-Aware Orchestration for Hierarchical Federated Learning

Anna Lackinger; Pantelis A. Frangoudis; Ivan Čilić; Alireza Furutanpey; Ilir Murturi; Ivana Podnar Žarko; Schahram Dustdar

Inference Load-Aware Orchestration for Hierarchical Federated Learning

Anna Lackinger, Pantelis A. Frangoudis, Ivan Čilić, Alireza Furutanpey, Ilir Murturi, Ivana Podnar Žarko, Schahram Dustdar

TL;DR

This work tackles the challenge of serving inference while training in hierarchical federated learning, particularly under continual learning in constrained compute environments. It introduces HFLOP, an inference load-aware ILP-based orchestration that places aggregators and assigns clients to minimize communication costs while respecting edge-inference capacity. Through METR-LA traffic data experiments, HFLOP demonstrates substantially lower inference latency (down to about $9.89$ ms) and reduced cloud traffic compared with flat FL and baseline hierarchical setups, even under capacity asymmetries. The results show practical benefits for edge-cloud orchestration in continual learning scenarios and provide a path toward scalable, low-latency, privacy-preserving inference in distributed systems, with open-source code to enable reproducibility and further research.

Abstract

Hierarchical federated learning (HFL) designs introduce intermediate aggregator nodes between clients and the global federated learning server in order to reduce communication costs and distribute server load. One side effect is that machine learning model replication at scale comes "for free" as part of the HFL process: model replicas are hosted at the client end, intermediate nodes, and the global server level and are readily available for serving inference requests. This creates opportunities for efficient model serving but simultaneously couples the training and serving processes and calls for their joint orchestration. This is particularly important for continual learning, where serving a model while (re)training it periodically, upon specific triggers, or continuously, takes place over shared infrastructure spanning the computing continuum. Consequently, training and inference workloads can interfere with detrimental effects on performance. To address this issue, we propose an inference load-aware HFL orchestration scheme, which makes informed decisions on HFL configuration, considering knowledge about inference workloads and the respective processing capacity. Applying our scheme to a continual learning use case in the transportation domain, we demonstrate that by optimizing aggregator node placement and device-aggregator association, significant inference latency savings can be achieved while communication costs are drastically reduced compared to flat centralized federated learning.

Inference Load-Aware Orchestration for Hierarchical Federated Learning

TL;DR

ms) and reduced cloud traffic compared with flat FL and baseline hierarchical setups, even under capacity asymmetries. The results show practical benefits for edge-cloud orchestration in continual learning scenarios and provide a path toward scalable, low-latency, privacy-preserving inference in distributed systems, with open-source code to enable reproducibility and further research.

Abstract

Paper Structure (25 sections, 1 equation, 9 figures)

This paper contains 25 sections, 1 equation, 9 figures.

Introduction
Related Work
Continuous federated learning
Joint training-inference optimization in federated learning
Applications to traffic flow prediction
System architecture
The inference-aware HFL orchestration problem
System model
Problem formulation
Performance considerations
Evaluation
Methodology and use case
Continuous learning performance
Continuous learning
Continuous federated learning
...and 10 more sections

Figures (9)

Figure 1: Architectural view of our hierarchical federated learning orchestration framework. Our framework supports the joint orchestration of distributed training and inference serving.
Figure 2: Execution times of deriving the optimal solution to HFLOP using a commercial solver (CPLEX). Mean execution times are reported with 95% confidence intervals.
Figure 3: Different FL clients are clustered and connected to their closest local server. An inference request is sent to the closest server, which processes the request on its model or sends it to a local server if it is busy training. The local server processes the request or, if it reaches its processing capacity, forwards it to a cloud server, which then answers the request.
Figure 4: Sensor distribution of the METR-LA dataset.
Figure 5: Clustered sensors using the METR-LA dataset.
...and 4 more figures

Inference Load-Aware Orchestration for Hierarchical Federated Learning

TL;DR

Abstract

Inference Load-Aware Orchestration for Hierarchical Federated Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (9)