Table of Contents
Fetching ...

Practical Performance Guarantees for Pipelined DNN Inference

Aaron Archer, Matthew Fahrbach, Kuikui Liu, Prakash Prabhu

TL;DR

The paper tackles max-throughput partitioning for pipelined DNN inference, where the bottleneck stage time $t$ limits throughput to $1/t$ and end-to-end latency is $kt$. It formalizes MTPP as an NP-hard problem, develops exact and relaxed MIP lower bounds, and introduces SliceGraph, a practical partitioning algorithm that combines dynamic programming with a biased random-key genetic algorithm. Through extensive offline experiments on 369 production graphs and synthetic REGAL graphs across $k ∈ {2,4,8,16,32,64}$, the authors demonstrate that the strongest per-instance lower bounds (including the exact and bottleneck-guess formulations) closely certify near-optimal partitions, with average lower-bound ratios approaching the empirical optimum (e.g., ~0.9452 for $k=16$) and substantial improvements over simple combinatorial bounds. These certificates enable principled stopping criteria for compiler-time partitioning, reducing wasted engineering effort while ensuring near-optimal throughput in practical DNN inference workloads.

Abstract

We optimize pipeline parallelism for deep neural network (DNN) inference by partitioning model graphs into $k$ stages and minimizing the running time of the bottleneck stage, including communication. We give practical and effective algorithms for this NP-hard problem, but our emphasis is on tackling the practitioner's dilemma of deciding when a solution is good enough. To this end, we design novel mixed-integer programming (MIP) relaxations for proving lower bounds. Applying these methods to a diverse testbed of 369 production models, for $k \in \{2, 4, 8, 16, 32, 64\}$, we empirically show that these lower bounds are strong enough to be useful in practice. Our lower bounds are substantially stronger than standard combinatorial bounds. For example, evaluated via geometric means across a production testbed with $k = 16$ pipeline stages, our MIP formulations raise the lower bound from 0.4598 to 0.9452, expressed as a fraction of the best partition found. In other words, our improved lower bounds close the optimality gap by a factor of 9.855x.

Practical Performance Guarantees for Pipelined DNN Inference

TL;DR

The paper tackles max-throughput partitioning for pipelined DNN inference, where the bottleneck stage time limits throughput to and end-to-end latency is . It formalizes MTPP as an NP-hard problem, develops exact and relaxed MIP lower bounds, and introduces SliceGraph, a practical partitioning algorithm that combines dynamic programming with a biased random-key genetic algorithm. Through extensive offline experiments on 369 production graphs and synthetic REGAL graphs across , the authors demonstrate that the strongest per-instance lower bounds (including the exact and bottleneck-guess formulations) closely certify near-optimal partitions, with average lower-bound ratios approaching the empirical optimum (e.g., ~0.9452 for ) and substantial improvements over simple combinatorial bounds. These certificates enable principled stopping criteria for compiler-time partitioning, reducing wasted engineering effort while ensuring near-optimal throughput in practical DNN inference workloads.

Abstract

We optimize pipeline parallelism for deep neural network (DNN) inference by partitioning model graphs into stages and minimizing the running time of the bottleneck stage, including communication. We give practical and effective algorithms for this NP-hard problem, but our emphasis is on tackling the practitioner's dilemma of deciding when a solution is good enough. To this end, we design novel mixed-integer programming (MIP) relaxations for proving lower bounds. Applying these methods to a diverse testbed of 369 production models, for , we empirically show that these lower bounds are strong enough to be useful in practice. Our lower bounds are substantially stronger than standard combinatorial bounds. For example, evaluated via geometric means across a production testbed with pipeline stages, our MIP formulations raise the lower bound from 0.4598 to 0.9452, expressed as a fraction of the best partition found. In other words, our improved lower bounds close the optimality gap by a factor of 9.855x.
Paper Structure (40 sections, 14 theorems, 8 equations, 5 figures, 3 tables, 5 algorithms)

This paper contains 40 sections, 14 theorems, 8 equations, 5 figures, 3 tables, 5 algorithms.

Key Result

Theorem 3.1

For $k=2$, MTPP is NP-hard. Furthermore, there does not exist a fully polynomial-time approximation scheme for MTPP, unless $\textnormal{P} = \textnormal{NP}$.

Figures (5)

  • Figure 1: Inference pipeline from startup to steady state with $k=3$ stages. Each inference batch is represented with the same color as it advances through the pipeline. Values $\texttt{i}_{b \ell}$, $\texttt{w}_{b \ell}$, $\texttt{o}_{b \ell}$ are the times needed for stage $\ell$ to get its input for batch $b$, process it, and flush its output. Stage $2$ is the bottleneck, i.e., $t = \texttt{i}_{*2} + \texttt{w}_{*2} + \texttt{o}_{*2}$, and limits system throughput. Empty space (white) denotes idle time.
  • Figure 2: Partitioning computation graphs: (left) tensor cut property where $\textnormal{io\xspace}(S, T) = 2$ because $v$ and $w$ consume the same tensor; (middle) invalid partition because blocks $P_2$ and $P_3$ form a cycle in the quotient graph; (right) valid partition with block costs for $k=3$.
  • Figure 3: Exact MIP for solving MTPP, where variables $x_{vb} \in \{0,1\}$ indicate whether node $v \in V$ is assigned to block $b \in [k]$.
  • Figure 4: Running times of $\textnormal{SliceGraph}\xspace$ and different MIP lower bound computations across the production models. Each point denotes a run for one graph, color-coded to denote $\textnormal{SliceGraph}\xspace$ partitioning vs. bottleneck, bottleneck-guess, and exact lower bounds. The bottleneck-guess times are summed across all $k$ MIP instances involved. Each plot is for a different value of $k$. In order to facilitate visual comparisons across the plots, all three employ the same $y$-axis. Some of the data tops out at 3600 seconds since that is where we set the MIP time limit.
  • Figure 5: Running times of $\textnormal{SliceGraph}\xspace$ and different MIP lower bound computations across the REGAL models. Each point denotes a run for one graph, color-coded to denote $\textnormal{SliceGraph}\xspace$ partitioning vs. bottleneck, bottleneck-guess, and exact lower bounds. The bottleneck-guess times are summed across all $k$ MIP instances involved. Each plot is for a different value of $k$. In order to facilitate visual comparisons across the plots, all three employ the same $y$-axis. Some of the data tops out at 3600 seconds since that is where we set the MIP time limit.

Theorems & Definitions (25)

  • Definition 2.1
  • Remark 2.2
  • Theorem 3.1
  • Theorem 3.2
  • Lemma 3.3: Simple lower bound
  • Corollary 3.3
  • Lemma 4.0
  • Lemma 4.0
  • Theorem 4.1
  • Theorem 1.1
  • ...and 15 more