On the Optimization of Model Aggregation for Federated Learning at the Network Edge
Mengyao Li, Noah Ploch, Sebastian Troia, Carlo Spatocco, Wolfgang Kellerer, Guido Maier
TL;DR
This work addresses the challenge of Federated Learning at the network edge within MEC-SD-WAN environments by developing online resource-management strategies for FL model aggregation. It introduces an edge-to-cloud aggregator overlay and two optimization approaches: an ILP formulation that minimizes cumulative weighted capacity and a scalable HFEL-MESH heuristic that jointly places aggregators and routes model updates, with TRFR as a key performance metric. A WatchEDGE-based discrete-event simulator validates the methods, showing HFEL-MESH closely approaches ILP performance while substantially reducing cloud-link congestion and improving TRFR compared to a baseline HFEL, thereby enabling more reliable edge FL. The results demonstrate a practical, scalable pathway to balance local edge computation and cloud communication, contributing to more resilient and efficient FL deployments in dynamic edge networks.
Abstract
The rapid increase in connected devices has signifi- cantly intensified the computational and communication demands on modern telecommunication networks. To address these chal- lenges, integrating advanced Machine Learning (ML) techniques like Federated Learning (FL) with emerging paradigms such as Multi-access Edge Computing (MEC) and Software-Defined Wide Area Networks (SD-WANs) is crucial. This paper intro- duces online resource management strategies specifically designed for FL model aggregation, utilizing intermediate aggregation at edge nodes. Our analysis highlights the benefits of incorporating edge aggregators to reduce network link congestion and maximize the potential of edge computing nodes. However, the risk of network congestion persists. To mitigate this, we propose a novel aggregation approach that deploys an aggregator overlay network. We present an Integer Linear Programming (ILP) model and a heuristic algorithm to optimize the routing within this overlay network. Our solution demonstrates improved adapt- ability to network resource utilization, significantly reducing FL training round failure rates by up to 15% while also alleviating cloud link congestion.
