Table of Contents
Fetching ...

A Granularity Characterization of Task Scheduling Effectiveness

Sana Taghipour Anvari, David Kaeli

TL;DR

This work introduces a granularity characterization framework that directly links scheduling overhead growth to task-graph dependency topology and demonstrates that overhead models derived from dependency topology accurately predict strong-scaling limits and enable a practical runtime decision rule for selecting dynamic or static execution without requiring exhaustive strong-scaling studies or extensive offline tuning.

Abstract

Task-based runtime systems provide flexible load balancing and portability for parallel scientific applications, but their strong scaling is highly sensitive to task granularity. As parallelism increases, scheduling overhead may transition from negligible to dominant, leading to rapid drops in performance for some algorithms, while remaining negligible for others. Although such effects are widely observed empirically, there is a general lack of understanding how algorithmic structure impacts whether dynamic scheduling is always beneficial. In this work, we introduce a granularity characterization framework that directly links scheduling overhead growth to task-graph dependency topology. We show that dependency structure, rather than problem size alone, governs how overhead scales with parallelism. Based on this observation, we characterize execution behavior using a simple granularity measure that indicates when scheduling overhead can be amortized by parallel computation and when scheduling overhead dominates performance. Through experimental evaluation on representative parallel workloads with diverse dependency patterns, we demonstrate that the proposed characterization explains both gradual and abrupt strong-scaling breakdowns observed in practice. We further show that overhead models derived from dependency topology accurately predict strong-scaling limits and enable a practical runtime decision rule for selecting dynamic or static execution without requiring exhaustive strong-scaling studies or extensive offline tuning.

A Granularity Characterization of Task Scheduling Effectiveness

TL;DR

This work introduces a granularity characterization framework that directly links scheduling overhead growth to task-graph dependency topology and demonstrates that overhead models derived from dependency topology accurately predict strong-scaling limits and enable a practical runtime decision rule for selecting dynamic or static execution without requiring exhaustive strong-scaling studies or extensive offline tuning.

Abstract

Task-based runtime systems provide flexible load balancing and portability for parallel scientific applications, but their strong scaling is highly sensitive to task granularity. As parallelism increases, scheduling overhead may transition from negligible to dominant, leading to rapid drops in performance for some algorithms, while remaining negligible for others. Although such effects are widely observed empirically, there is a general lack of understanding how algorithmic structure impacts whether dynamic scheduling is always beneficial. In this work, we introduce a granularity characterization framework that directly links scheduling overhead growth to task-graph dependency topology. We show that dependency structure, rather than problem size alone, governs how overhead scales with parallelism. Based on this observation, we characterize execution behavior using a simple granularity measure that indicates when scheduling overhead can be amortized by parallel computation and when scheduling overhead dominates performance. Through experimental evaluation on representative parallel workloads with diverse dependency patterns, we demonstrate that the proposed characterization explains both gradual and abrupt strong-scaling breakdowns observed in practice. We further show that overhead models derived from dependency topology accurately predict strong-scaling limits and enable a practical runtime decision rule for selecting dynamic or static execution without requiring exhaustive strong-scaling studies or extensive offline tuning.
Paper Structure (26 sections, 15 equations, 6 figures, 2 tables)

This paper contains 26 sections, 15 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Illustrative dependency topologies at $P=4$ ranks. Edges represent logical inter-rank dependencies that constrain task execution order between computation phases. Global dependency (left) induces all-to-all dependencies, local dependency (center) induces nearest-neighbor dependencies, and independent workloads (right) induce no runtime dependencies.
  • Figure 2: Contrasting failure modes under strong scaling with dynamic scheduling. FFT (top row) exhibits an overhead cliff where scheduling overhead remains negligible across moderate scales but rises abruptly at high rank counts, consuming a majority of execution time at 256 ranks. Stencil (bottom row) exhibits gradual degradation where overhead grows steadily with scale and only approaches parity with useful work at the highest ranks.
  • Figure 3: Overhead fraction versus granularity number $G$ for all measured configurations. Each point represents one experimental configuration. Background shading indicates execution regimes: detrimental (red, $G < 1$), marginal (yellow, $1 \leq G < 10$), and beneficial (green, $G \geq 10$). While all configurations obey the same relationship, their trajectories under strong scaling differ: FFT (circles) traverses all three regimes, stencil and sweep (squares and diamonds) reach the marginal regime at high rank counts, and GEMM triangles) remains entirely in the beneficial regime.
  • Figure 4: Calibration and validation of the direct scheduling-overhead model across three input sizes. Top panels fit the overhead model; bottom panels overlay it against kernel strong scaling. The red dot marks the predicted crossover $P^\star$. Shaded regions indicate the empirical interval between the last non-detrimental and first detrimental ($G < 1$) measured configuration. Where observed, $P^\star$ falls within or adjacent to this interval.
  • Figure 5: Calibration and validation of the scheduling-overhead model for four additional workloads. Top row: SpMV uses a five-point Laplacian on a 2D grid with 1D row decomposition and nearest-neighbor halo exchange; Conv2D uses 1D row decomposition with a $3\times3$ kernel, $C_{\mathrm{in}}=3$ input channels, and $C_{\mathrm{out}}=16$ output channels. Bottom row: PageRank and N-Body. All workloads are configured for comparable total computational work (${\sim}453$M operations per iteration): SpMV and Conv2D operate on a $21{,}285^2$ grid, N-Body computes $O(N^2)$ pairwise interactions among $N=21{,}285$ particles, and PageRank processes $N=30.2$M vertices with average degree 15 (${\sim}453$M edges).
  • ...and 1 more figures