Table of Contents
Fetching ...

NotebookOS: A Replicated Notebook Platform for Interactive Training with On-Demand GPUs

Benjamin Carver, Jingyuan Zhang, Haoliang Wang, Kanak Mahadik, Yue Cheng

TL;DR

The paper addresses the inefficiency of reserving GPUs for idle notebook sessions in interactive ML tasks. It introduces NotebookOS, a replicated-kernel platform that uses Raft-based state replication and dynamic, on-demand GPU binding to provide immediate GPU access during active cell execution, while oversubscribing resources to boost utilization. Through prototype experiments and simulation on production traces, NotebookOS demonstrates significant GPU-hour savings and improved interactivity compared with traditional reservation and batch approaches, with notable cost reductions and resilience. The work offers a practical path toward cost-effective, highly responsive notebook-based IDLT and establishes a foundation for further GPU sharing and scale-out enhancements.

Abstract

Interactive notebook programming is universal in modern ML and AI workflows, with interactive deep learning training (IDLT) emerging as a dominant use case. To ensure responsiveness, platforms like Jupyter and Colab reserve GPUs for long-running notebook sessions, despite their intermittent and sporadic GPU usage, leading to extremely low GPU utilization and prohibitively high costs. In this paper, we introduce NotebookOS, a GPU-efficient notebook platform tailored for the unique requirements of IDLT. NotebookOS employs replicated notebook kernels with Raft-synchronized replicas distributed across GPU servers. To optimize GPU utilization, NotebookOS oversubscribes server resources, leveraging high interarrival times in IDLT workloads, and allocates GPUs only during active cell execution. It also supports replica migration and automatic cluster scaling under high load. Altogether, this design enables interactive training with minimal delay. In evaluation on production workloads, NotebookOS saved over 1,187 GPU hours in 17.5 hours of real-world IDLT, while significantly improving interactivity.

NotebookOS: A Replicated Notebook Platform for Interactive Training with On-Demand GPUs

TL;DR

The paper addresses the inefficiency of reserving GPUs for idle notebook sessions in interactive ML tasks. It introduces NotebookOS, a replicated-kernel platform that uses Raft-based state replication and dynamic, on-demand GPU binding to provide immediate GPU access during active cell execution, while oversubscribing resources to boost utilization. Through prototype experiments and simulation on production traces, NotebookOS demonstrates significant GPU-hour savings and improved interactivity compared with traditional reservation and batch approaches, with notable cost reductions and resilience. The work offers a practical path toward cost-effective, highly responsive notebook-based IDLT and establishes a foundation for further GPU sharing and scale-out enhancements.

Abstract

Interactive notebook programming is universal in modern ML and AI workflows, with interactive deep learning training (IDLT) emerging as a dominant use case. To ensure responsiveness, platforms like Jupyter and Colab reserve GPUs for long-running notebook sessions, despite their intermittent and sporadic GPU usage, leading to extremely low GPU utilization and prohibitively high costs. In this paper, we introduce NotebookOS, a GPU-efficient notebook platform tailored for the unique requirements of IDLT. NotebookOS employs replicated notebook kernels with Raft-synchronized replicas distributed across GPU servers. To optimize GPU utilization, NotebookOS oversubscribes server resources, leveraging high interarrival times in IDLT workloads, and allocates GPUs only during active cell execution. It also supports replica migration and automatic cluster scaling under high load. Altogether, this design enables interactive training with minimal delay. In evaluation on production workloads, NotebookOS saved over 1,187 GPU hours in 17.5 hours of real-world IDLT, while significantly improving interactivity.

Paper Structure

This paper contains 58 sections, 20 figures, 2 tables.

Figures (20)

  • Figure 1: Client-server Jupyter Notebook architecture.
  • Figure 2: Workload characteristics of three representative GPU cluster traces.
  • Figure 3: Architecture overview of NotebookOS.
  • Figure 4: Process of creating a new kernel within NotebookOS's Docker Compose and Docker Swarm modes.
  • Figure 5: NotebookOS's executor election protocol.
  • ...and 15 more figures