Table of Contents
Fetching ...

NEST: Network- and Memory-Aware Device Placement For Distributed Deep Learning

Irene Wang, Vishnu Varma Venkata, Arvind Krishnamurthy, Divya Mahajan

TL;DR

NEST is presented, a network-, compute-, and memory-aware device placement framework that unifies model parallelism, topology modeling, and memory feasibility via structured dynamic programming, providing a foundation for co-designing parallelization strategies and datacenter interconnects for next-generation AI infrastructure.

Abstract

The growing scale of deep learning demands distributed training frameworks that jointly reason about parallelism, memory, and network topology. Prior works often rely on heuristic or topology-agnostic search, handling communication and memory separately. Without per-device memory awareness, these methods typically ensure feasibility post hoc by sharding parameters and activations across many devices, increasing synchronization, inflating communication, and underutilizing compute-limiting scalability and efficiency on real datacenter networks. We present NEST, a network-, compute-, and memory-aware device placement framework that unifies model parallelism, topology modeling, and memory feasibility via structured dynamic programming. NEST's DP operates on operator graphs with tensor and expert parallel configurations, explicit allreduce latencies across hierarchical or arbitrary networks, and memory/compute profiles. By factoring parallelism across tensor, pipeline, data, and expert dimensions, NEST defines a principled search space for hybrid strategies while jointly optimizing co-location, network latency, and memory feasibility. Evaluations across diverse hardware and networks show NEST achieves up to 2.43 times higher throughput, better memory efficiency, and improved scalability over state-of-the-art baselines, providing a foundation for co-designing parallelization strategies and datacenter interconnects for next-generation AI infrastructure. The source code of NEST is available at: https://github.com/scai-tech/Nest

NEST: Network- and Memory-Aware Device Placement For Distributed Deep Learning

TL;DR

NEST is presented, a network-, compute-, and memory-aware device placement framework that unifies model parallelism, topology modeling, and memory feasibility via structured dynamic programming, providing a foundation for co-designing parallelization strategies and datacenter interconnects for next-generation AI infrastructure.

Abstract

The growing scale of deep learning demands distributed training frameworks that jointly reason about parallelism, memory, and network topology. Prior works often rely on heuristic or topology-agnostic search, handling communication and memory separately. Without per-device memory awareness, these methods typically ensure feasibility post hoc by sharding parameters and activations across many devices, increasing synchronization, inflating communication, and underutilizing compute-limiting scalability and efficiency on real datacenter networks. We present NEST, a network-, compute-, and memory-aware device placement framework that unifies model parallelism, topology modeling, and memory feasibility via structured dynamic programming. NEST's DP operates on operator graphs with tensor and expert parallel configurations, explicit allreduce latencies across hierarchical or arbitrary networks, and memory/compute profiles. By factoring parallelism across tensor, pipeline, data, and expert dimensions, NEST defines a principled search space for hybrid strategies while jointly optimizing co-location, network latency, and memory feasibility. Evaluations across diverse hardware and networks show NEST achieves up to 2.43 times higher throughput, better memory efficiency, and improved scalability over state-of-the-art baselines, providing a foundation for co-designing parallelization strategies and datacenter interconnects for next-generation AI infrastructure. The source code of NEST is available at: https://github.com/scai-tech/Nest
Paper Structure (42 sections, 3 equations, 12 figures, 7 tables, 1 algorithm)

This paper contains 42 sections, 3 equations, 12 figures, 7 tables, 1 algorithm.

Figures (12)

  • Figure 1: (Left) Nest compared to prior works across network modeling, optimality, scalability, and memory modeling axes. (Right) Comparison of placement with and without integrated memory modeling. Sharded operators are shown through patterns.
  • Figure 2: Impact of communication latency on training time across different parallelism strategies on an oversubscribed 64-GPU cluster; left bar (without) and right (with) activation recomputation.
  • Figure 3: Nest search space and workflow. Graph-Global strategies partition entire layers, whereas Sub-Graph strategies partition computations within individual layers. Different colors denote device assignments, and dashed boxes show ZeRO and context parallelism.
  • Figure 4: The forward-pass dependency challenge and Nest 's level-wise abstraction. The cost from unassigned Layer 22 to Layer 23 is unknown but abstracted as a discrete communication level (e.g., Level 0, 1, or 2).
  • Figure 5: Throughput comparison between Nest and baselines on a Fat-Tree network of TPUv4 accelerators. Throughput improvements are relative to the manual baseline’s smallest valid result. “X” indicates cases where the baseline failed to find a valid placement.
  • ...and 7 more figures