Table of Contents
Fetching ...

SpecInF: Exploiting Idle GPU Resources in Distributed DL Training via Speculative Inference Filling

Cunchi Lv, Xiao Shi, Dong Liang, Wenting Tan, Xiaofang Zhao

TL;DR

This work addresses low GPU utilization in distributed DL training by proposing SpecInF, which speculatively fills idle GPU bubbles with collocated online and offline inference tasks. It introduces a two-layer system with a Bubble Monitor and CUDA Kernel Scheduler to detect idling and manage token-based inference execution while safeguarding training throughput, including a Kernel Barrier to coordinate access. The approach yields up to $14×$ offline inference throughput over TGS and a notable $p_{95}$ latency reduction for online inference relative to MPS, across DP/MP/PP configurations, with minimal overhead. Practically, SpecInF enables more efficient use of GPU resources in large-scale DL workloads (e.g., LLMs) by amortizing idle compute and memory, offering tangible gains without sacrificing training performance.

Abstract

Deep Learning (DL), especially with Large Language Models (LLMs), brings benefits to various areas. However, DL training systems usually yield prominent idling GPU resources due to many factors, such as resource allocation and collective communication. To improve GPU utilization, we present SpecInF, which adopts a Speculative Inference Filling method to exploit idle GPU resources. It collocates each primary training instance with additional inference instances on the same GPU, detects the training bubbles and adaptively fills with online or offline inference workloads. Our results show that SpecInF can effectively enhance GPU utilization under mainstream parallel training modes, delivering additional up to 14$\times$ offline inference throughputs than TGS and 67\% reduction in online inference p95 latency than MPS, while guaranteeing collocated training throughput.

SpecInF: Exploiting Idle GPU Resources in Distributed DL Training via Speculative Inference Filling

TL;DR

This work addresses low GPU utilization in distributed DL training by proposing SpecInF, which speculatively fills idle GPU bubbles with collocated online and offline inference tasks. It introduces a two-layer system with a Bubble Monitor and CUDA Kernel Scheduler to detect idling and manage token-based inference execution while safeguarding training throughput, including a Kernel Barrier to coordinate access. The approach yields up to offline inference throughput over TGS and a notable latency reduction for online inference relative to MPS, across DP/MP/PP configurations, with minimal overhead. Practically, SpecInF enables more efficient use of GPU resources in large-scale DL workloads (e.g., LLMs) by amortizing idle compute and memory, offering tangible gains without sacrificing training performance.

Abstract

Deep Learning (DL), especially with Large Language Models (LLMs), brings benefits to various areas. However, DL training systems usually yield prominent idling GPU resources due to many factors, such as resource allocation and collective communication. To improve GPU utilization, we present SpecInF, which adopts a Speculative Inference Filling method to exploit idle GPU resources. It collocates each primary training instance with additional inference instances on the same GPU, detects the training bubbles and adaptively fills with online or offline inference workloads. Our results show that SpecInF can effectively enhance GPU utilization under mainstream parallel training modes, delivering additional up to 14 offline inference throughputs than TGS and 67\% reduction in online inference p95 latency than MPS, while guaranteeing collocated training throughput.

Paper Structure

This paper contains 19 sections, 8 figures, 1 algorithm.

Figures (8)

  • Figure 1: The GPU compute utilization timeline of two modes, as monitored by the nvml APIs. (a) training RoBERTa-large model in DP mode via PyTorch.DDP; (b) fine-tuning LLaMA2-7B in MP mode via DeepSpeed. Both two cases involve 4 GPU workers.
  • Figure 2: GPU occupying characteristics of distributed training and inference.
  • Figure 3: The system architecture of SpecInF.
  • Figure 4: DP performance comparison: (a) the solid bar represents normalized training throughput and the light bar with dashed lines represents normalized offline inference throughput. (b) bars in the upper subfigure indicate normalized training throughputs, and the lower subfigure shows p95 latency of online inference. TGS is excluded due to excessive tail latencies.
  • Figure 5: MP performance comparison.
  • ...and 3 more figures