Bridging Memory Gaps: Scaling Federated Learning for Heterogeneous Clients
Yebo Wu, Jingguang Li, Chunlin Tian, Kahou Tam, Li Li, Chengzhong Xu
TL;DR
This work tackles the memory bottleneck in federated learning by introducing ScaleFL, a scalable framework that trains a global model in sequential blocks. It couples a Curriculum Mentor, based on information bottleneck principles and HSIC estimates, with a Training Harmonizer that enables bidirectional information flow across blocks, thereby mitigating information loss and gradient isolation. Empirical results across diverse datasets, device heterogeneity, and even Transformer-based models demonstrate substantial gains in accuracy, memory efficiency, and convergence speed, including non-IID scenarios and large-scale benchmarks. Theoretical convergence guarantees further support the approach, showing that ScaleFL converges to a stationary point under standard smoothness and bounded-gradient assumptions, with the curriculum and co-adaptation components key to stability and performance.
Abstract
Federated Learning (FL) enables multiple clients to collaboratively train a shared model while preserving data privacy. However, the high memory demand during model training severely limits the deployment of FL on resource-constrained clients. To this end, we propose \our, a scalable and inclusive FL framework designed to overcome memory limitations through sequential block-wise training. The core idea of \our is to partition the global model into blocks and train them sequentially, thereby reducing training memory requirements. To mitigate information loss during block-wise training, \our introduces a Curriculum Mentor that crafts curriculum-aware training objectives for each block to steer their learning process. Moreover, \our incorporates a Training Harmonizer that designs a parameter co-adaptation training scheme to coordinate block updates, effectively breaking inter-block information isolation. Extensive experiments on both simulation and hardware testbeds demonstrate that \our significantly improves model performance by up to 84.2\%, reduces peak memory usage by up to 50.4\%, and accelerates training by up to 1.9$\times$.
