SpecInF: Exploiting Idle GPU Resources in Distributed DL Training via Speculative Inference Filling
Cunchi Lv, Xiao Shi, Dong Liang, Wenting Tan, Xiaofang Zhao
TL;DR
This work addresses low GPU utilization in distributed DL training by proposing SpecInF, which speculatively fills idle GPU bubbles with collocated online and offline inference tasks. It introduces a two-layer system with a Bubble Monitor and CUDA Kernel Scheduler to detect idling and manage token-based inference execution while safeguarding training throughput, including a Kernel Barrier to coordinate access. The approach yields up to $14×$ offline inference throughput over TGS and a notable $p_{95}$ latency reduction for online inference relative to MPS, across DP/MP/PP configurations, with minimal overhead. Practically, SpecInF enables more efficient use of GPU resources in large-scale DL workloads (e.g., LLMs) by amortizing idle compute and memory, offering tangible gains without sacrificing training performance.
Abstract
Deep Learning (DL), especially with Large Language Models (LLMs), brings benefits to various areas. However, DL training systems usually yield prominent idling GPU resources due to many factors, such as resource allocation and collective communication. To improve GPU utilization, we present SpecInF, which adopts a Speculative Inference Filling method to exploit idle GPU resources. It collocates each primary training instance with additional inference instances on the same GPU, detects the training bubbles and adaptively fills with online or offline inference workloads. Our results show that SpecInF can effectively enhance GPU utilization under mainstream parallel training modes, delivering additional up to 14$\times$ offline inference throughputs than TGS and 67\% reduction in online inference p95 latency than MPS, while guaranteeing collocated training throughput.
