Table of Contents
Fetching ...

FreeRide: Harvesting Bubbles in Pipeline Parallelism

Jiashu Zhang, Zihan Pan, Molly, Xu, Khuzaima Daudjee, Sihang Liu

TL;DR

The paper tackles the underutilization of GPUs caused by bubbles in pipeline parallelism during LLM training. It introduces FreeRide, a middleware that exposes two side-task interfaces (iterative and imperative) and uses profiling-guided management to schedule generic GPU workloads on bubbles with minimal disruption to the main pipeline. Key contributions include a state-machine based side-task framework, GPU memory and execution-time limiting via CUDA MPS and runtime controls, and an evaluation across model training, graph analytics, and image processing workloads showing average cost savings of about 7.8% with ~1% overhead. This approach enables practical co-location of diverse workloads with expensive LLM training, improving GPU utilization and reducing training costs without requiring heavy customizations to the training framework.

Abstract

The occurrence of bubbles in pipeline parallelism is an inherent limitation that can account for more than 40% of the large language model (LLM) training time and is one of the main reasons for the underutilization of GPU resources in LLM training. Harvesting these bubbles for GPU side tasks can increase resource utilization and reduce training costs but comes with challenges. First, because bubbles are discontinuous with various shapes, programming side tasks becomes difficult while requiring excessive engineering effort. Second, a side task can compete with pipeline training for GPU resources and incur significant overhead. To address these challenges, we propose FreeRide, a system designed to harvest bubbles in pipeline parallelism for side tasks. FreeRide provides programmers with interfaces to implement side tasks easily, manages bubbles and side tasks during pipeline training, and controls access to GPU resources by side tasks to reduce overhead. We demonstrate that FreeRide achieves 7.8% average cost savings with a negligible overhead of about 1% in training LLMs while serving model training, graph analytics, and image processing side tasks.

FreeRide: Harvesting Bubbles in Pipeline Parallelism

TL;DR

The paper tackles the underutilization of GPUs caused by bubbles in pipeline parallelism during LLM training. It introduces FreeRide, a middleware that exposes two side-task interfaces (iterative and imperative) and uses profiling-guided management to schedule generic GPU workloads on bubbles with minimal disruption to the main pipeline. Key contributions include a state-machine based side-task framework, GPU memory and execution-time limiting via CUDA MPS and runtime controls, and an evaluation across model training, graph analytics, and image processing workloads showing average cost savings of about 7.8% with ~1% overhead. This approach enables practical co-location of diverse workloads with expensive LLM training, improving GPU utilization and reducing training costs without requiring heavy customizations to the training framework.

Abstract

The occurrence of bubbles in pipeline parallelism is an inherent limitation that can account for more than 40% of the large language model (LLM) training time and is one of the main reasons for the underutilization of GPU resources in LLM training. Harvesting these bubbles for GPU side tasks can increase resource utilization and reduce training costs but comes with challenges. First, because bubbles are discontinuous with various shapes, programming side tasks becomes difficult while requiring excessive engineering effort. Second, a side task can compete with pipeline training for GPU resources and incur significant overhead. To address these challenges, we propose FreeRide, a system designed to harvest bubbles in pipeline parallelism for side tasks. FreeRide provides programmers with interfaces to implement side tasks easily, manages bubbles and side tasks during pipeline training, and controls access to GPU resources by side tasks to reduce overhead. We demonstrate that FreeRide achieves 7.8% average cost savings with a negligible overhead of about 1% in training LLMs while serving model training, graph analytics, and image processing side tasks.
Paper Structure (34 sections, 5 equations, 9 figures, 2 tables, 2 algorithms)

This paper contains 34 sections, 5 equations, 9 figures, 2 tables, 2 algorithms.

Figures (9)

  • Figure 1: A pipeline training epoch in DeepSpeed.
  • Figure 2: Statistics of bubbles under different model sizes.
  • Figure 3: Workflow of FreeRide.
  • Figure 4: State transitions in a side task program.
  • Figure 5: Architecture of FreeRide.
  • ...and 4 more figures