Table of Contents
Fetching ...

LIFL: A Lightweight, Event-driven Serverless Platform for Federated Learning

Shixiong Qi, K. K. Ramakrishnan, Myungjin Lee

TL;DR

LIFL tackles the inefficiencies of large-scale federated learning by designing a lightweight, event-driven serverless platform that enables hierarchical aggregation with a high-performance intra-node dataplane. Key innovations include a zero-copy shared memory data plane, in-place message queuing, an eBPF-based sidecar for lightweight processing and direct routing, and locality-aware planning plus aggregator reuse to maximize parallelism and minimize startup delays. The approach yields substantial improvements in latency and CPU efficiency compared with both serverful and existing serverless FL systems, demonstrated on ResNet workloads with hundreds of clients. The work has practical impact by enabling scalable, cost-effective FL deployments in dynamic, heterogeneous environments and points to future work in asynchronous FL and broader platform integration.

Abstract

Federated Learning (FL) typically involves a large-scale, distributed system with individual user devices/servers training models locally and then aggregating their model updates on a trusted central server. Existing systems for FL often use an always-on server for model aggregation, which can be inefficient in terms of resource utilization. They may also be inelastic in their resource management. This is particularly exacerbated when aggregating model updates at scale in a highly dynamic environment with varying numbers of heterogeneous user devices/servers. We present LIFL, a lightweight and elastic serverless cloud platform with fine-grained resource management for efficient FL aggregation at scale. LIFL is enhanced by a streamlined, event-driven serverless design that eliminates the individual heavy-weight message broker and replaces inefficient container-based sidecars with lightweight eBPF-based proxies. We leverage shared memory processing to achieve high-performance communication for hierarchical aggregation, which is commonly adopted to speed up FL aggregation at scale. We further introduce locality-aware placement in LIFL to maximize the benefits of shared memory processing. LIFL precisely scales and carefully reuses the resources for hierarchical aggregation to achieve the highest degree of parallelism while minimizing the aggregation time and resource consumption. Our experimental results show that LIFL achieves significant improvement in resource efficiency and aggregation speed for supporting FL at scale, compared to existing serverful and serverless FL systems.

LIFL: A Lightweight, Event-driven Serverless Platform for Federated Learning

TL;DR

LIFL tackles the inefficiencies of large-scale federated learning by designing a lightweight, event-driven serverless platform that enables hierarchical aggregation with a high-performance intra-node dataplane. Key innovations include a zero-copy shared memory data plane, in-place message queuing, an eBPF-based sidecar for lightweight processing and direct routing, and locality-aware planning plus aggregator reuse to maximize parallelism and minimize startup delays. The approach yields substantial improvements in latency and CPU efficiency compared with both serverful and existing serverless FL systems, demonstrated on ResNet workloads with hundreds of clients. The work has practical impact by enabling scalable, cost-effective FL deployments in dynamic, heterogeneous environments and points to future work in asynchronous FL and broader platform integration.

Abstract

Federated Learning (FL) typically involves a large-scale, distributed system with individual user devices/servers training models locally and then aggregating their model updates on a trusted central server. Existing systems for FL often use an always-on server for model aggregation, which can be inefficient in terms of resource utilization. They may also be inelastic in their resource management. This is particularly exacerbated when aggregating model updates at scale in a highly dynamic environment with varying numbers of heterogeneous user devices/servers. We present LIFL, a lightweight and elastic serverless cloud platform with fine-grained resource management for efficient FL aggregation at scale. LIFL is enhanced by a streamlined, event-driven serverless design that eliminates the individual heavy-weight message broker and replaces inefficient container-based sidecars with lightweight eBPF-based proxies. We leverage shared memory processing to achieve high-performance communication for hierarchical aggregation, which is commonly adopted to speed up FL aggregation at scale. We further introduce locality-aware placement in LIFL to maximize the benefits of shared memory processing. LIFL precisely scales and carefully reuses the resources for hierarchical aggregation to achieve the highest degree of parallelism while minimizing the aggregation time and resource consumption. Our experimental results show that LIFL achieves significant improvement in resource efficiency and aggregation speed for supporting FL at scale, compared to existing serverful and serverless FL systems.
Paper Structure (30 sections, 1 equation, 14 figures)

This paper contains 30 sections, 1 equation, 14 figures.

Figures (14)

  • Figure 1: Synchronous FL with different aggregation timing ("Eager" and "Lazy") google-fl-mlsys19jit-agg.
  • Figure 2: Generic architectures for FL systems: (a) Serverful FL systems google-fl-mlsys19papaya-mlsys2022; (b) Serverless FL systems lambda-fladafedfedkeeper. Note that for simplicity, we skip the hierarchy in the diagram (b).
  • Figure 3: The overall architecture of LIFL.
  • Figure 4: Impact of data plane performance on hierarchical aggregation. (upper fig.:) No hierarchy(NH); (lower fig.:) With hierarchy(WH). Top: top aggregator; LF: leaf aggregator. "Network" denotes the data transfer tasks of model updates; "Agg." denotes the aggregation tasks; "Eval." denotes the evaluation tasks.
  • Figure 5: Message queuing solutions.
  • ...and 9 more figures