Table of Contents
Fetching ...

PARD: Enhancing Goodput for Inference Pipeline via Proactive Request Dropping

Zhixin Zhao, Yitao Hu, Simin Chen, Mingfang Ji, Wei Yang, Yuhao Zhang, Laiping Zhao, Wenxin Li, Xiulong Liu, Wenyu Qu, Hao Wang

TL;DR

PARD tackles the problem of maintaining high goodput in latency-sensitive multi-model DNN inference pipelines by shifting from reactive to proactive request dropping. It integrates bi-directional runtime information to estimate end-to-end latency and employs a DEPQ-based adaptive priority mechanism to decide when and which requests to drop, balancing remaining latency budgets with workload dynamics. The approach decomposes end-to-end latency into preceding, current, and subsequent components and uses a three-round heuristic to set downstream batch-wait constraints, enabling timely drops earlier in the pipeline. Empirical evaluation on a 64-GPU cluster across real-world pipelines demonstrates substantial gains in goodput (16%–176% over state-of-the-art) with large reductions in drop rate (1.6×–17×) and wasted computation (1.5×–62×), and robust performance under varied SLOs and workloads.

Abstract

Modern deep neural network (DNN) applications integrate multiple DNN models into inference pipelines with stringent latency requirements for customized tasks. To mitigate extensive request timeouts caused by accumulation, systems for inference pipelines commonly drop a subset of requests so the remaining ones can satisfy latency constraints. Since it is commonly believed that request dropping adversely affects goodput, existing systems only drop requests when they have to, which we call reactive dropping. However, this reactive policy can not maintain high goodput, as it neither makes timely dropping decisions nor identifies the proper set of requests to drop, leading to issues of dropping requests too late or dropping the wrong set of requests. We propose that the inference system should proactively drop certain requests in advance to enhance the goodput across the entire workload. To achieve this, we design an inference system PARD. It enhances goodput with timely and precise dropping decisions by integrating a proactive dropping method that decides when to drop requests using runtime information of the inference pipeline, and an adaptive request priority mechanism that selects which specific requests to drop based on remaining latency budgets and workload intensity. Evaluation on a cluster of 64 GPUs over real-world workloads shows that PARD achieves $16\%$-$176\%$ higher goodput than the state of the art while reducing the drop rate and wasted computation resources by $1.6\times$-$17\times$ and $1.5\times$-$62\times$ respectively.

PARD: Enhancing Goodput for Inference Pipeline via Proactive Request Dropping

TL;DR

PARD tackles the problem of maintaining high goodput in latency-sensitive multi-model DNN inference pipelines by shifting from reactive to proactive request dropping. It integrates bi-directional runtime information to estimate end-to-end latency and employs a DEPQ-based adaptive priority mechanism to decide when and which requests to drop, balancing remaining latency budgets with workload dynamics. The approach decomposes end-to-end latency into preceding, current, and subsequent components and uses a three-round heuristic to set downstream batch-wait constraints, enabling timely drops earlier in the pipeline. Empirical evaluation on a 64-GPU cluster across real-world pipelines demonstrates substantial gains in goodput (16%–176% over state-of-the-art) with large reductions in drop rate (1.6×–17×) and wasted computation (1.5×–62×), and robust performance under varied SLOs and workloads.

Abstract

Modern deep neural network (DNN) applications integrate multiple DNN models into inference pipelines with stringent latency requirements for customized tasks. To mitigate extensive request timeouts caused by accumulation, systems for inference pipelines commonly drop a subset of requests so the remaining ones can satisfy latency constraints. Since it is commonly believed that request dropping adversely affects goodput, existing systems only drop requests when they have to, which we call reactive dropping. However, this reactive policy can not maintain high goodput, as it neither makes timely dropping decisions nor identifies the proper set of requests to drop, leading to issues of dropping requests too late or dropping the wrong set of requests. We propose that the inference system should proactively drop certain requests in advance to enhance the goodput across the entire workload. To achieve this, we design an inference system PARD. It enhances goodput with timely and precise dropping decisions by integrating a proactive dropping method that decides when to drop requests using runtime information of the inference pipeline, and an adaptive request priority mechanism that selects which specific requests to drop based on remaining latency budgets and workload intensity. Evaluation on a cluster of 64 GPUs over real-world workloads shows that PARD achieves - higher goodput than the state of the art while reducing the drop rate and wasted computation resources by - and - respectively.
Paper Structure (19 sections, 4 equations, 15 figures, 2 tables)

This paper contains 19 sections, 4 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Request latency composition under various dropping policies in a three-model inference pipeline.
  • Figure 2: (a) and (b) The minimum goodput and corresponding drop rate across various time window sizes of existing inference systems, naive baseline, and PARD under lv-tweet workload. (c) The percentage of dropped requests at each module under different workloads from §\ref{['sec:eval_methodology']} with the reactive dropping policy. (d) Transient drop rate of the reactive dropping policy.
  • Figure 3: (a) The reactive dropping policy makes decisions based on request arrival order, leading to the drop-wrong-set issue. (b) Batched requests have different batch wait times $W$, ranging from $0$ to the batch execution duration $d$.
  • Figure 4: PARD overview
  • Figure 5: Lifecycle of a request $R$ sent at $t_s$ by the client in an $N$-module pipeline, where $t_r$, $t_b$, $t_e$ represent the moments when the request is received by module $M_k$, put into a batch, and when the batch execution starts, respectively. At $t_b$, PARD could get all bi-directional runtime information for each request and make a timely dropping decision before it enters a batch.
  • ...and 10 more figures