Table of Contents
Fetching ...

ElasticNotebook: Enabling Live Migration for Computational Notebooks

Zhaoheng Li, Pranav Gor, Rahul Prabhu, Hui Yu, Yuzhou Mao, Yongjoo Park

TL;DR

ElasticNotebook Tackles the problem of losing notebook state during live migration by introducing a transparent data layer and the Application History Graph (AHG) to model variable and cell dependencies. It couples on-the-fly program analyses with a graph-based optimization that reduces state replication to a min-cut problem, balancing variable copying and recomputation to minimize migration and restoration times while preserving isomorphism of references. The approach achieves 85-98% faster migration and 94-99% faster restoration with negligible runtime/memory overhead and up to 66% smaller checkpoint sizes compared to baselines. Across diverse workloads and architectures, ElasticNotebook demonstrates robustness, portability, and scalability, offering practical benefits for elastic computing, on-demand scaling, and seamless user experience in data science workflows.

Abstract

Computational notebooks (e.g., Jupyter, Google Colab) are widely used for interactive data science and machine learning. In those frameworks, users can start a session, then execute cells (i.e., a set of statements) to create variables, train models, visualize results, etc. Unfortunately, existing notebook systems do not offer live migration: when a notebook launches on a new machine, it loses its state, preventing users from continuing their tasks from where they had left off. This is because, unlike DBMS, the sessions directly rely on underlying kernels (e.g., Python/R interpreters) without an additional data management layer. Existing techniques for preserving states, such as copying all variables or OS-level checkpointing, are unreliable (often fail), inefficient, and platform-dependent. Also, re-running code from scratch can be highly time-consuming. In this paper, we introduce a new notebook system, ElasticNotebook, that offers live migration via checkpointing/restoration using a novel mechanism that is reliable, efficient, and platform-independent. Specifically, by observing all cell executions via transparent, lightweight monitoring, ElasticNotebook can find a reliable and efficient way (i.e., replication plan) for reconstructing the original session state, considering variable-cell dependencies, observed runtime, variable sizes, etc. To this end, our new graph-based optimization problem finds how to reconstruct all variables (efficiently) from a subset of variables that can be transferred across machines. We show that ElasticNotebook reduces end-to-end migration and restoration times by 85%-98% and 94%-99%, respectively, on a variety (i.e., Kaggle, JWST, and Tutorial) of notebooks with negligible runtime and memory overheads of <2.5% and <10%.

ElasticNotebook: Enabling Live Migration for Computational Notebooks

TL;DR

ElasticNotebook Tackles the problem of losing notebook state during live migration by introducing a transparent data layer and the Application History Graph (AHG) to model variable and cell dependencies. It couples on-the-fly program analyses with a graph-based optimization that reduces state replication to a min-cut problem, balancing variable copying and recomputation to minimize migration and restoration times while preserving isomorphism of references. The approach achieves 85-98% faster migration and 94-99% faster restoration with negligible runtime/memory overhead and up to 66% smaller checkpoint sizes compared to baselines. Across diverse workloads and architectures, ElasticNotebook demonstrates robustness, portability, and scalability, offering practical benefits for elastic computing, on-demand scaling, and seamless user experience in data science workflows.

Abstract

Computational notebooks (e.g., Jupyter, Google Colab) are widely used for interactive data science and machine learning. In those frameworks, users can start a session, then execute cells (i.e., a set of statements) to create variables, train models, visualize results, etc. Unfortunately, existing notebook systems do not offer live migration: when a notebook launches on a new machine, it loses its state, preventing users from continuing their tasks from where they had left off. This is because, unlike DBMS, the sessions directly rely on underlying kernels (e.g., Python/R interpreters) without an additional data management layer. Existing techniques for preserving states, such as copying all variables or OS-level checkpointing, are unreliable (often fail), inefficient, and platform-dependent. Also, re-running code from scratch can be highly time-consuming. In this paper, we introduce a new notebook system, ElasticNotebook, that offers live migration via checkpointing/restoration using a novel mechanism that is reliable, efficient, and platform-independent. Specifically, by observing all cell executions via transparent, lightweight monitoring, ElasticNotebook can find a reliable and efficient way (i.e., replication plan) for reconstructing the original session state, considering variable-cell dependencies, observed runtime, variable sizes, etc. To this end, our new graph-based optimization problem finds how to reconstruct all variables (efficiently) from a subset of variables that can be transferred across machines. We show that ElasticNotebook reduces end-to-end migration and restoration times by 85%-98% and 94%-99%, respectively, on a variety (i.e., Kaggle, JWST, and Tutorial) of notebooks with negligible runtime and memory overheads of <2.5% and <10%.
Paper Structure (91 sections, 1 theorem, 4 equations, 17 figures, 5 tables)

This paper contains 91 sections, 1 theorem, 4 equations, 17 figures, 5 tables.

Key Result

theorem 1

Given the approximate AHG $\mathcal{G}$ of ElasticNotebook with false positives, and the true AHG $\mathcal{G}^*$, there is $req^*(x, t^*) \subseteq req(x, t)$ for any variable $x \in \mathcal{X}$, where $(x, t)$ and $(x, t^*)$, $req$ and $req^*$ are the active VSs of $x$ and reconstruction mapping

Figures (17)

  • Figure 1: Our transparent data layer (in the middle) enables robust, efficient, and platform-independent live migration.
  • Figure 2: For every cell run, we can inject custom pre-/post-processing logic. "%%intercept" is hidden to users.
  • Figure 3: Example app history (top) and different replication plan costs (bottom). Combining recompute/copy allows faster migration (Fast-migrate). Alternatively, the optimal plan changes if the restoration is prioritized (Fast-restore).
  • Figure 4: ElasticNotebook architecture. Its data layer acts as a gateway between the user interface and the kernel: cell executions are intercepted to observe session state changes.
  • Figure 5: An example notebook and its corresponding Application History Graph. The AHG tells ElasticNotebook how to recompute variables; for example, rerunning $c_{t_1}$ and $c_{t_3}$ is necessary for recomputing x (red).
  • ...and 12 more figures

Theorems & Definitions (10)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Definition 6
  • Definition 7
  • theorem 1
  • Definition 8
  • Definition 9