Table of Contents
Fetching ...

Duration Aware Scheduling for ASR Serving Under Workload Drift

Darshan Makwana, Yash Jogi, Harsh Kotta, Aayush Kubba

Abstract

Scheduling policies in large-scale Automatic Speech Recognition (ASR) serving pipelines play a key role in determining end-to-end (E2E) latency. Yet, widely used serving engines rely on first-come-first-served (FCFS) scheduling, which ignores variability in request duration and leads to head-of-line blocking under workload drift. We show that audio duration is an accurate proxy for job processing time in ASR models such as Whisper, and use this insight to enable duration-aware scheduling. We integrate two classical algorithms, Shortest Job First (SJF) and Highest Response Ratio Next (HRRN), into vLLM and evaluate them under realistic and drifted workloads. On LibriSpeech test-clean, compared to baseline, SJF reduces median E2E latency by up to $73\%$ at high load, but increases $90$th-percentile tail latency by up to $97\%$ due to starvation of long requests. HRRN addresses this trade-off: it reduces median E2E latency by up to $28\%$ while bounding tail-latency degradation to at most $24\%$. These gains persist under workload drift, with no throughput penalty and $<0.1$\,ms scheduling overhead per request.

Duration Aware Scheduling for ASR Serving Under Workload Drift

Abstract

Scheduling policies in large-scale Automatic Speech Recognition (ASR) serving pipelines play a key role in determining end-to-end (E2E) latency. Yet, widely used serving engines rely on first-come-first-served (FCFS) scheduling, which ignores variability in request duration and leads to head-of-line blocking under workload drift. We show that audio duration is an accurate proxy for job processing time in ASR models such as Whisper, and use this insight to enable duration-aware scheduling. We integrate two classical algorithms, Shortest Job First (SJF) and Highest Response Ratio Next (HRRN), into vLLM and evaluate them under realistic and drifted workloads. On LibriSpeech test-clean, compared to baseline, SJF reduces median E2E latency by up to at high load, but increases th-percentile tail latency by up to due to starvation of long requests. HRRN addresses this trade-off: it reduces median E2E latency by up to while bounding tail-latency degradation to at most . These gains persist under workload drift, with no throughput penalty and \,ms scheduling overhead per request.
Paper Structure (35 sections, 2 equations, 13 figures, 2 tables)

This paper contains 35 sections, 2 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Toy example illustrating head-of-line blocking under FCFS and the benefit of duration-aware scheduling. Three requests arrive in order $R_1,R_2,R_3$ with audio durations $8$ s, $4$ s, and $2$ s. We assume a constant encoder cost of $1$ s per request and a decoding rate of $5$ output tokens/s, yielding $40$, $20$, and $10$ output tokens, respectively. Under FCFS, serving $R_1$ first delays the two shorter requests and yields an average end-to-end latency of $(5{+}8{+}10)/3=7.66$ s. Reordering by shortest job first (SJF) reduces the average to $(2{+}5{+}10)/3=5.66$ s (a $1.4\times$ improvement). All numbers are illustrative and not intended to match measured system timings.
  • Figure 2: Scatter plots showing the relationship between audio duration and ASR output token count. (a) On the LibriSpeech English test set, token count increases linearly with audio duration, indicating a strong correlation. (b) On the FLEURS test sets for Spanish, Hindi, and Arabic, the linear duration–token relationship is maintained, demonstrating that this correlation generalizes across typologically diverse languages.
  • Figure 3: LibriSpeech test-clean: End-to-end latency scaling ($1$--$25$ req/s). (a) SJF reduces $P50$ E2EL by up to $73\%$ at $25$ req/s, (b) while $P90$ E2EL reveals the starvation trade-off at high load.
  • Figure 4: LibriSpeech test-clean: Percentage change in E2EL versus FCFS ($1$--$25$ req/s). Negative values indicate improvement. SJF's $P50$ gains deepen monotonically with load, while $P90$ degradation emerges beyond $15$ req/s.
  • Figure 5: Synthetic split: End-to-end latency scaling ($1$--$30$ req/s). (a) SJF's $P50$ E2EL advantage ($-67\%$ at $25$ req/s) persists under a uniform duration distribution, (b) while $P90$ E2EL degradation is moderated compared to LibriSpeech's right-skewed workload.
  • ...and 8 more figures