Table of Contents
Fetching ...

Efficiently Reproducing Distributed Workflows in Notebook-based Systems

Talha Azaz, Raza Ahmad, Md Saiful Islam, Douglas Thain, Tanu Malik

Abstract

Notebooks provide an author-friendly environment for iterative development, modular execution, and easy sharing. Distributed workflows are increasingly being authored and executed in notebooks, yet sharing and reproducing them remains challenging. Even small code or parameter changes often force full end-to-end re-execution of the distributed workflow, limiting iterative development for such workloads. Current methods for improving notebook execution operate on single-node workflows, while optimization techniques for distributed workflows typically sacrifice reproducibility. We introduce NBRewind, a notebook kernel system for efficient, reproducible execution of distributed workflows in notebooks. NBRewind consists of two kernels--audit and repeat. The audit kernel performs incremental, cell-level checkpointing to avoid unnecessary re-runs; repeat reconstructs checkpoints and enables partial re-execution including notebook cells that manage distributed workflow. Both kernel methods are based on data-flow analysis across cells. We show how checkpoints and logs when packaged as part of standardized notebook specification improve sharing and reproducibility. Using real-world case studies we show that creating incremental checkpoints adds minimal overhead and enables portable, cross-site reproducibility of notebook-based distributed workflows on HPC systems.

Efficiently Reproducing Distributed Workflows in Notebook-based Systems

Abstract

Notebooks provide an author-friendly environment for iterative development, modular execution, and easy sharing. Distributed workflows are increasingly being authored and executed in notebooks, yet sharing and reproducing them remains challenging. Even small code or parameter changes often force full end-to-end re-execution of the distributed workflow, limiting iterative development for such workloads. Current methods for improving notebook execution operate on single-node workflows, while optimization techniques for distributed workflows typically sacrifice reproducibility. We introduce NBRewind, a notebook kernel system for efficient, reproducible execution of distributed workflows in notebooks. NBRewind consists of two kernels--audit and repeat. The audit kernel performs incremental, cell-level checkpointing to avoid unnecessary re-runs; repeat reconstructs checkpoints and enables partial re-execution including notebook cells that manage distributed workflow. Both kernel methods are based on data-flow analysis across cells. We show how checkpoints and logs when packaged as part of standardized notebook specification improve sharing and reproducibility. Using real-world case studies we show that creating incremental checkpoints adds minimal overhead and enables portable, cross-site reproducibility of notebook-based distributed workflows on HPC systems.

Paper Structure

This paper contains 15 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Example of a distributed workflow in a notebook with a manager creating worker threads to compute summary statistics.
  • Figure 2: The notebook environment comprising web-based notebook client running the notebook code which interacts with the kernel process in the notebook server.
  • Figure 3: The distributed notebook workflow is illustrated when the kernel launches the manager which spawns several worker threads that run in parallel on the worker node.
  • Figure 4: Architecture of NBRewind is illustrated The audit kernel on the host cluster executes a notebook and creates its corresponding notebook container. The notebook and its container are shared with a collaborator to be executed in a target cluster using the repeat kernel. Large volumes of data in a notebook container may be fetched dynamically from a remote data store.
  • Figure 5: The internal states maintained during the NBRewind workflow to create checkpoints, as illustrated by the first two cells of our example notebook.
  • ...and 4 more figures