Table of Contents
Fetching ...

FedHybrid: Breaking the Memory Wall of Federated Learning via Hybrid Tensor Management

Kahou Tam, Chunlin Tian, Li Li, Haikai Zhao, ChengZhong Xu

TL;DR

FedHybrid tackles the memory wall in on-device Federated Learning by coordinating memory-aware client selection, heterogeneity-aware graph optimization, and a local training engine that uses channel-wise mix compression and recomputation. It introduces a Memory Budget Predictor and a novel MPS-based optimization framework to balance memory reduction, model accuracy, and training efficiency under dynamic device contention. Empirical results on CV and NLP tasks show up to 39.1% accuracy gains and up to 15.5× wall-clock time reductions compared with baselines, across diverse memory budgets and devices. The work demonstrates practical viability for large-scale, mobile FL and provides a pathway to deploying privacy-preserving learning in resource-constrained environments with heterogeneous hardware and background workloads.

Abstract

Federated Learning (FL) emerges as a new learning paradigm that enables multiple devices to collaboratively train a shared model while preserving data privacy. However, one fundamental and prevailing challenge that hinders the deployment of FL on mobile devices is the memory limitation. This paper proposes \textit{FedHybrid}, a novel framework that effectively reduces the memory footprint during the training process while guaranteeing the model accuracy and the overall training progress. Specifically, \textit{FedHybrid} first selects the participating devices for each training round by jointly evaluating their memory budget, computing capability, and data diversity. After that, it judiciously analyzes the computational graph and generates an execution plan for each selected client in order to meet the corresponding memory budget while minimizing the training delay through employing a hybrid of recomputation and compression techniques according to the characteristic of each tensor. During the local training process, \textit{FedHybrid} carries out the execution plan with a well-designed activation compression technique to effectively achieve memory reduction with minimum accuracy loss. We conduct extensive experiments to evaluate \textit{FedHybrid} on both simulation and off-the-shelf mobile devices. The experiment results demonstrate that \textit{FedHybrid} achieves up to a 39.1\% increase in model accuracy and a 15.5$\times$ reduction in wall clock time under various memory budgets compared with the baselines.

FedHybrid: Breaking the Memory Wall of Federated Learning via Hybrid Tensor Management

TL;DR

FedHybrid tackles the memory wall in on-device Federated Learning by coordinating memory-aware client selection, heterogeneity-aware graph optimization, and a local training engine that uses channel-wise mix compression and recomputation. It introduces a Memory Budget Predictor and a novel MPS-based optimization framework to balance memory reduction, model accuracy, and training efficiency under dynamic device contention. Empirical results on CV and NLP tasks show up to 39.1% accuracy gains and up to 15.5× wall-clock time reductions compared with baselines, across diverse memory budgets and devices. The work demonstrates practical viability for large-scale, mobile FL and provides a pathway to deploying privacy-preserving learning in resource-constrained environments with heterogeneous hardware and background workloads.

Abstract

Federated Learning (FL) emerges as a new learning paradigm that enables multiple devices to collaboratively train a shared model while preserving data privacy. However, one fundamental and prevailing challenge that hinders the deployment of FL on mobile devices is the memory limitation. This paper proposes \textit{FedHybrid}, a novel framework that effectively reduces the memory footprint during the training process while guaranteeing the model accuracy and the overall training progress. Specifically, \textit{FedHybrid} first selects the participating devices for each training round by jointly evaluating their memory budget, computing capability, and data diversity. After that, it judiciously analyzes the computational graph and generates an execution plan for each selected client in order to meet the corresponding memory budget while minimizing the training delay through employing a hybrid of recomputation and compression techniques according to the characteristic of each tensor. During the local training process, \textit{FedHybrid} carries out the execution plan with a well-designed activation compression technique to effectively achieve memory reduction with minimum accuracy loss. We conduct extensive experiments to evaluate \textit{FedHybrid} on both simulation and off-the-shelf mobile devices. The experiment results demonstrate that \textit{FedHybrid} achieves up to a 39.1\% increase in model accuracy and a 15.5 reduction in wall clock time under various memory budgets compared with the baselines.

Paper Structure

This paper contains 20 sections, 10 equations, 21 figures, 4 tables, 1 algorithm.

Figures (21)

  • Figure 1: Performance impact of memory constraints on Oort and memory utilization for different models. (a) Accuracy of MobileNetV2 trained on OpenImage with a batch size of 32. (b) Accuracy of Bert trained on AGNEWS with a batch size of 8. (c) Memory utilization on mobile devices during training, measured with MNNalibaba2020mnn; MNV2 denotes MobileNetV2.
  • Figure 2: The analysis of local training runtime on mobile devices. We use MNN to conduct the local training without any background application. (a) Compare the runtimes of different devices under 16 and 32 batch sizes. (b) Breakdown of the training process's runtime status and average system memory usage during training with 32 batch sizes. The US represents the Uninterruptible Sleep status, R represents the Running status, and RB represents the Runnable status. (c) Distribution of evicted pages during the training process with 32 batch sizes in S22.
  • Figure 3: The performance of existing memory-saving techniques applied in FL with memory constraints.
  • Figure 4: The I/O latency of write and read for a 128 MB File in UFS 3.1: simulation of swapping In and Out a 128 MB tensor. (a) A higher frequency of the CPU core correlates with increased I/O performance. (b) Multi-Apps deteriorate the I/O performance due to contention in the UFS command queue.
  • Figure 5: Efficiency analysis of activation recomputation in FL with heterogeneous devices. All the experiments are conducted on Melon wang2022melon. Comparison of different memory budgets versus training latency overhead in MobileNetV2 with a batch size of 32, and BERT with a batch size of 8, on the S22 and Note 10, respectively.
  • ...and 16 more figures