Table of Contents
Fetching ...

PruneX: A Hierarchical Communication-Efficient System for Distributed CNN Training with Structured Pruning

Alireza Olama, Andreas Lundell, Izzat El Hajj, Johan Lilius, Jerker Björkqvist

TL;DR

<3-5 sentence high-level summary> PruneX tackles the inter-node bandwidth bottleneck in distributed CNN training by co-designing structured pruning with cluster topology through a Hierarchical Structured ADMM (H-SADMM). It enforces node-level sparsity before inter-node synchronization, enabling physical buffer shrinkage and dense-kernel computation on compressed tensors, realized via a leader-follower architecture and a two-tier consensus. Empirical results on 64 GPUs show ~60% reduction in inter-node communication and 6.75x strong scaling, outperforming dense DDP and Top-K baselines on Puhti. The work demonstrates robust convergence and meaningful sparsity-accuracy trade-offs, with a clear path toward scaling to larger models and deeper hierarchies.

Abstract

Inter-node communication bandwidth increasingly constrains distributed training at scale on multi-node GPU clusters. While compact models are the ultimate deployment target, conventional pruning-aware distributed training systems typically fail to reduce communication overhead because unstructured sparsity cannot be efficiently exploited by highly optimized dense collective primitives. We present PruneX, a distributed data-parallel training system that co-designs pruning algorithms with cluster hierarchy to reduce inter-node bandwidth usage. PruneX introduces the Hierarchical Structured ADMM (H-SADMM) algorithm, which enforces node-level structured sparsity before inter-node synchronization, enabling dynamic buffer compaction that eliminates both zero-valued transmissions and indexing overhead. The system adopts a leader-follower execution model with separated intra-node and inter-node process groups, performing dense collectives on compacted tensors over bandwidth-limited links while confining full synchronization to high-bandwidth intra-node interconnects. Evaluation on ResNet architectures across 64 GPUs demonstrates that PruneX reduces inter-node communication volume by approximately 60% and achieves 6.75x strong scaling speedup, outperforming the dense baseline (5.81x) and Top-K gradient compression (3.71x) on the Puhti supercomputer at CSC - IT Center for Science (Finland).

PruneX: A Hierarchical Communication-Efficient System for Distributed CNN Training with Structured Pruning

TL;DR

<3-5 sentence high-level summary> PruneX tackles the inter-node bandwidth bottleneck in distributed CNN training by co-designing structured pruning with cluster topology through a Hierarchical Structured ADMM (H-SADMM). It enforces node-level sparsity before inter-node synchronization, enabling physical buffer shrinkage and dense-kernel computation on compressed tensors, realized via a leader-follower architecture and a two-tier consensus. Empirical results on 64 GPUs show ~60% reduction in inter-node communication and 6.75x strong scaling, outperforming dense DDP and Top-K baselines on Puhti. The work demonstrates robust convergence and meaningful sparsity-accuracy trade-offs, with a clear path toward scaling to larger models and deeper hierarchies.

Abstract

Inter-node communication bandwidth increasingly constrains distributed training at scale on multi-node GPU clusters. While compact models are the ultimate deployment target, conventional pruning-aware distributed training systems typically fail to reduce communication overhead because unstructured sparsity cannot be efficiently exploited by highly optimized dense collective primitives. We present PruneX, a distributed data-parallel training system that co-designs pruning algorithms with cluster hierarchy to reduce inter-node bandwidth usage. PruneX introduces the Hierarchical Structured ADMM (H-SADMM) algorithm, which enforces node-level structured sparsity before inter-node synchronization, enabling dynamic buffer compaction that eliminates both zero-valued transmissions and indexing overhead. The system adopts a leader-follower execution model with separated intra-node and inter-node process groups, performing dense collectives on compacted tensors over bandwidth-limited links while confining full synchronization to high-bandwidth intra-node interconnects. Evaluation on ResNet architectures across 64 GPUs demonstrates that PruneX reduces inter-node communication volume by approximately 60% and achieves 6.75x strong scaling speedup, outperforming the dense baseline (5.81x) and Top-K gradient compression (3.71x) on the Puhti supercomputer at CSC - IT Center for Science (Finland).

Paper Structure

This paper contains 47 sections, 16 equations, 13 figures, 2 tables, 1 algorithm.

Figures (13)

  • Figure 1: A schematic comparison between (a) the proposed hierarchical consensus structure utilizing node leaders to bridge local and global states, and (b) the standard flat consensus structure where all workers communicate directly with the global variable.
  • Figure 2: PruneX System Architecture. The system design maps the physical cluster topology to a four-layer logical software stack, comprising the Distributed Execution Substrate, Hierarchical State Manager, Pruning Engine, and Orchestration Control Loop.
  • Figure 3: Runtime Execution Timeline. The schedule illustrates the overlapping of high-bandwidth intra-node synchronization (AllReduce and Broadcast) with local computation, while latency-sensitive inter-node communication is isolated to the node leaders.
  • Figure 4: Physical Shrinkage and Recovery Pipeline. The mechanism transforms sparse tensors into compact dense buffers using global masks, executes the inter-node AllReduce on reduced payloads, and restores the full tensor shape via zero-filling for subsequent local training.
  • Figure 5: End-to-End Training Efficiency on ResNet-152. (a) PruneX (Green) reaches the 70% target accuracy faster than DDP (Blue) and Top-K (Pink), validating that algorithmic overhead is outweighed by communication gains. (b) Accuracy vs. Inter-node Communication Volume demonstrates that PruneX achieves high accuracy with a fraction of the data transfer required by dense training.
  • ...and 8 more figures