Table of Contents
Fetching ...

FCDP: Fully Cached Data Parallel for Communication-Avoiding Large-Scale Training

Gyeongseo Park, Eungyeong Lee, Song-woo Sok, Myung-Hoon Cha, Kwangwon Koh, Baik-Song An, Hongyeon Kim, Ki-Dong Kang

TL;DR

FCDP addresses the inter-node communication bottleneck in fully sharded data parallel training (ZeRO-3) on commodity hardware by introducing host-memory as a fast caching layer. It combines three components: FCDP-Sched to cache forward parameters for the backward pass, FCDP-Cache to adaptively place parameters between GPU and host memory, and FCDP-Comm to exploit PEFT-awareness by caching frozen weights and only exchanging trainable adapters. This yields a 50% reduction in backward inter-node communication, up to 99.9% reduction for PEFT workloads, and up to 41.3% higher throughput than ZeRO-3 while maintaining the same maximum batch size. On commodity clusters, FCDP thus matches or exceeds the throughput of GPU-based caching methods without sacrificing memory capacity, enabling efficient training of very large models and PEFT fine-tuning with minimal inter-node traffic.

Abstract

Training billion-parameter models requires distributing model states across GPUs using fully sharded data parallel (i.e., ZeRO-3). While ZeRO-3 succeeds on clusters with high-bandwidth NVLink and InfiniBand interconnects, researchers with commodity hardware face severe inter-node all-gather bottlenecks. Existing optimizations take two approaches: GPU memory caching (MiCS, ZeRO++) trades memory capacity for reduced communication, triggering out-of-memory failures on large models; host memory offloading (ZeRO-Offload, ZeRO-Infinity) extends capacity but degrades throughput due to PCIe overhead. We observe that on bandwidth-limited clusters, host memory can serve not as an overflow tier but as a fast caching layer that outperforms inter-node communication. Based on this insight, we propose FCDP, which eliminates redundant inter-node communication while preserving ZeRO-3's minimal GPU memory footprint. FCDP caches forward-pass parameters in host memory and reuses them during the backward pass via fast intra-node all-gather, reducing inter-node all-gather by 50%. For parameter-efficient fine-tuning (PEFT), FCDP selectively communicates only trainable parameters to maximize caching, reducing inter-node traffic by over 99%. In our commodity cluster setup, FCDP achieves up to 100x higher throughput than ZeRO-3 and 51x higher than ZeRO++, while maintaining ZeRO-3's maximum batch size.

FCDP: Fully Cached Data Parallel for Communication-Avoiding Large-Scale Training

TL;DR

FCDP addresses the inter-node communication bottleneck in fully sharded data parallel training (ZeRO-3) on commodity hardware by introducing host-memory as a fast caching layer. It combines three components: FCDP-Sched to cache forward parameters for the backward pass, FCDP-Cache to adaptively place parameters between GPU and host memory, and FCDP-Comm to exploit PEFT-awareness by caching frozen weights and only exchanging trainable adapters. This yields a 50% reduction in backward inter-node communication, up to 99.9% reduction for PEFT workloads, and up to 41.3% higher throughput than ZeRO-3 while maintaining the same maximum batch size. On commodity clusters, FCDP thus matches or exceeds the throughput of GPU-based caching methods without sacrificing memory capacity, enabling efficient training of very large models and PEFT fine-tuning with minimal inter-node traffic.

Abstract

Training billion-parameter models requires distributing model states across GPUs using fully sharded data parallel (i.e., ZeRO-3). While ZeRO-3 succeeds on clusters with high-bandwidth NVLink and InfiniBand interconnects, researchers with commodity hardware face severe inter-node all-gather bottlenecks. Existing optimizations take two approaches: GPU memory caching (MiCS, ZeRO++) trades memory capacity for reduced communication, triggering out-of-memory failures on large models; host memory offloading (ZeRO-Offload, ZeRO-Infinity) extends capacity but degrades throughput due to PCIe overhead. We observe that on bandwidth-limited clusters, host memory can serve not as an overflow tier but as a fast caching layer that outperforms inter-node communication. Based on this insight, we propose FCDP, which eliminates redundant inter-node communication while preserving ZeRO-3's minimal GPU memory footprint. FCDP caches forward-pass parameters in host memory and reuses them during the backward pass via fast intra-node all-gather, reducing inter-node all-gather by 50%. For parameter-efficient fine-tuning (PEFT), FCDP selectively communicates only trainable parameters to maximize caching, reducing inter-node traffic by over 99%. In our commodity cluster setup, FCDP achieves up to 100x higher throughput than ZeRO-3 and 51x higher than ZeRO++, while maintaining ZeRO-3's maximum batch size.
Paper Structure (28 sections, 6 equations, 10 figures, 7 tables, 1 algorithm)

This paper contains 28 sections, 6 equations, 10 figures, 7 tables, 1 algorithm.

Figures (10)

  • Figure 1: Comparison of hypercluster and commodity cluster.
  • Figure 2: ZeRO-3 training throughput (GPT-10B, batch 8) across network configurations.
  • Figure 3: Overview of FCDP's three components: FCDP-Sched (parameter scheduling), FCDP-Cache (adaptive memory placement), and FCDP-Comm (PEFT-aware communication).
  • Figure 4: Per-layer execution schedule comparison. ZeRO-3 performs inter-node AG twice; ZeRO++ caches on GPU for intra-node backward AG; FCDP caches on CPU, achieving intra-node backward AG without GPU memory overhead.
  • Figure 5: Strong scaling performance across GPT models (10B--30B). ZeRO++ encounters out-of-memory failures on larger models (marked OOM).
  • ...and 5 more figures