Table of Contents
Fetching ...

A Communication- and Memory-Aware Model for Load Balancing Tasks

Jonathan Lifflander, Philippe P. Pebay, Nicole L. Slattengren, Pierre L. Pebay, Robert A. Pfeiffer, Joseph D. Kotulski, Sean T. McGovern

TL;DR

The paper tackles load balancing in distributed-memory systems under strict memory constraints by introducing CCM, a reduced-order model that jointly accounts for computation, communication, and memory. It proposes CCM-LB, a fully distributed heuristic load balancer, and validates its near-optimality via MILP reductions (COMCP and FWMP). The Gemma electromagnetics code serves as a practical testbed, achieving up to 2.3x speedups and demonstrating scalability across scales, aided by a neural time predictor trained on diverse configurations. This work offers a principled, scalable pathway to performance-portable load balancing for irregular workloads with memory considerations, with broad potential impact on exascale, task-based, memory-bound applications.

Abstract

While load balancing in distributed-memory computing has been well-studied, we present an innovative approach to this problem: a unified, reduced-order model that combines three key components to describe "work" in a distributed system: computation, communication, and memory. Our model enables an optimizer to explore complex tradeoffs in task placement, such as increased parallelism at the expense of data replication, which increases memory usage. We propose a fully distributed, heuristic-based load balancing optimization algorithm, and demonstrate that it quickly finds close-to-optimal solutions. We formalize the complex optimization problem as a mixed-integer linear program, and compare it to our strategy. Finally, we show that when applied to an electromagnetics code, our approach obtains up to 2.3x speedups for the imbalanced execution.

A Communication- and Memory-Aware Model for Load Balancing Tasks

TL;DR

The paper tackles load balancing in distributed-memory systems under strict memory constraints by introducing CCM, a reduced-order model that jointly accounts for computation, communication, and memory. It proposes CCM-LB, a fully distributed heuristic load balancer, and validates its near-optimality via MILP reductions (COMCP and FWMP). The Gemma electromagnetics code serves as a practical testbed, achieving up to 2.3x speedups and demonstrating scalability across scales, aided by a neural time predictor trained on diverse configurations. This work offers a principled, scalable pathway to performance-portable load balancing for irregular workloads with memory considerations, with broad potential impact on exascale, task-based, memory-bound applications.

Abstract

While load balancing in distributed-memory computing has been well-studied, we present an innovative approach to this problem: a unified, reduced-order model that combines three key components to describe "work" in a distributed system: computation, communication, and memory. Our model enables an optimizer to explore complex tradeoffs in task placement, such as increased parallelism at the expense of data replication, which increases memory usage. We propose a fully distributed, heuristic-based load balancing optimization algorithm, and demonstrate that it quickly finds close-to-optimal solutions. We formalize the complex optimization problem as a mixed-integer linear program, and compare it to our strategy. Finally, we show that when applied to an electromagnetics code, our approach obtains up to 2.3x speedups for the imbalanced execution.
Paper Structure (34 sections, 10 theorems, 37 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 34 sections, 10 theorems, 37 equations, 5 figures, 1 table, 1 algorithm.

Key Result

theorem 3.1

Figures (5)

  • Figure 1: The CCM-LB algorithm.
  • Figure 2: A Compute-Only Memory-Constrained Problem (COMCP) example for $I$$=$$2$, $K$$=$$3$, and $N$$=$$2$, with corresponding assignment sets and matrices.
  • Figure 3: A FWMP example for $I$$=$$2$, $K$$=$$3$, $M$$=$$4$, and $N$$=$$2$, with corresponding communication assignment sets and tensors.
  • Figure 4: Results comparing the Gurobi (MILP) solutions to CCM-LB.
  • Figure 5: Speedup of the assembly at each scale.

Theorems & Definitions (20)

  • theorem 3.1: Homing communications update formulæ
  • proof
  • theorem 5.1: Boolean shared block matrix relations
  • proof
  • theorem 5.2: Integer shared block matrix relations
  • proof
  • theorem 5.3: Boolean communication tensor relations
  • proof
  • theorem 5.4: Integer communication tensor relations
  • proof
  • ...and 10 more