FCDP: Fully Cached Data Parallel for Communication-Avoiding Large-Scale Training
Gyeongseo Park, Eungyeong Lee, Song-woo Sok, Myung-Hoon Cha, Kwangwon Koh, Baik-Song An, Hongyeon Kim, Ki-Dong Kang
TL;DR
FCDP addresses the inter-node communication bottleneck in fully sharded data parallel training (ZeRO-3) on commodity hardware by introducing host-memory as a fast caching layer. It combines three components: FCDP-Sched to cache forward parameters for the backward pass, FCDP-Cache to adaptively place parameters between GPU and host memory, and FCDP-Comm to exploit PEFT-awareness by caching frozen weights and only exchanging trainable adapters. This yields a 50% reduction in backward inter-node communication, up to 99.9% reduction for PEFT workloads, and up to 41.3% higher throughput than ZeRO-3 while maintaining the same maximum batch size. On commodity clusters, FCDP thus matches or exceeds the throughput of GPU-based caching methods without sacrificing memory capacity, enabling efficient training of very large models and PEFT fine-tuning with minimal inter-node traffic.
Abstract
Training billion-parameter models requires distributing model states across GPUs using fully sharded data parallel (i.e., ZeRO-3). While ZeRO-3 succeeds on clusters with high-bandwidth NVLink and InfiniBand interconnects, researchers with commodity hardware face severe inter-node all-gather bottlenecks. Existing optimizations take two approaches: GPU memory caching (MiCS, ZeRO++) trades memory capacity for reduced communication, triggering out-of-memory failures on large models; host memory offloading (ZeRO-Offload, ZeRO-Infinity) extends capacity but degrades throughput due to PCIe overhead. We observe that on bandwidth-limited clusters, host memory can serve not as an overflow tier but as a fast caching layer that outperforms inter-node communication. Based on this insight, we propose FCDP, which eliminates redundant inter-node communication while preserving ZeRO-3's minimal GPU memory footprint. FCDP caches forward-pass parameters in host memory and reuses them during the backward pass via fast intra-node all-gather, reducing inter-node all-gather by 50%. For parameter-efficient fine-tuning (PEFT), FCDP selectively communicates only trainable parameters to maximize caching, reducing inter-node traffic by over 99%. In our commodity cluster setup, FCDP achieves up to 100x higher throughput than ZeRO-3 and 51x higher than ZeRO++, while maintaining ZeRO-3's maximum batch size.
