Table of Contents
Fetching ...

When Are Reactive Notebooks Not Reactive?

Megan Zheng, Will Crichton, Akshay Narayan, Deepti Raghavan, Nikos Vasilakis

TL;DR

Computational notebooks lack consistent reactivity when edits occur during execution. The authors introduce Rex, a fine-grained micro-benchmark with formal definitions of execution order, consistency, and soundness to evaluate reactive notebook systems. Evaluating Rex on Marimo, Observable, and IPyflow (plus baselines) reveals that direct assignments are reliably handled, but reassignment, mutations, and external-state interactions frequently cause undefined or incorrect reactivity, with IPyflow performing best among live systems but still missing edge cases. The work argues for clearer guarantees and tooling, and positions Rex as a practical aid for researchers and developers to improve reactive notebook implementations and user understanding.

Abstract

Computational notebooks are convenient for programmers, but can easily become confusing and inconsistent due to the ability to incrementally edit a program that is running. Recent reactive notebook systems, such as Ipyflow, Marimo and Observable, strive to keep notebook state in sync with the current cell code by re-executing a minimal set of cells upon modification. However, each system defines reactivity a different way. Additionally, within any definition, we find simple notebook modifications that can break each system. Overall, these inconsistencies make it difficult for users to construct a mental model of their reactive notebook's implementation. This paper proposes Rex, a fine-grained test suite to discuss and assess reactivity capabilities within reactive notebook systems. We evaluate Rex on three existing reactive notebook systems and classify their failures with the aims of (i) helping programmers understand when reactivity fails and (ii) helping notebook implementations improve.

When Are Reactive Notebooks Not Reactive?

TL;DR

Computational notebooks lack consistent reactivity when edits occur during execution. The authors introduce Rex, a fine-grained micro-benchmark with formal definitions of execution order, consistency, and soundness to evaluate reactive notebook systems. Evaluating Rex on Marimo, Observable, and IPyflow (plus baselines) reveals that direct assignments are reliably handled, but reassignment, mutations, and external-state interactions frequently cause undefined or incorrect reactivity, with IPyflow performing best among live systems but still missing edge cases. The work argues for clearer guarantees and tooling, and positions Rex as a practical aid for researchers and developers to improve reactive notebook implementations and user understanding.

Abstract

Computational notebooks are convenient for programmers, but can easily become confusing and inconsistent due to the ability to incrementally edit a program that is running. Recent reactive notebook systems, such as Ipyflow, Marimo and Observable, strive to keep notebook state in sync with the current cell code by re-executing a minimal set of cells upon modification. However, each system defines reactivity a different way. Additionally, within any definition, we find simple notebook modifications that can break each system. Overall, these inconsistencies make it difficult for users to construct a mental model of their reactive notebook's implementation. This paper proposes Rex, a fine-grained test suite to discuss and assess reactivity capabilities within reactive notebook systems. We evaluate Rex on three existing reactive notebook systems and classify their failures with the aims of (i) helping programmers understand when reactivity fails and (ii) helping notebook implementations improve.

Paper Structure

This paper contains 12 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Notebook's Current Reactivity System: Current notebook implementations which offer reactivity do not react in a consistent, predictable manner, leading to potential confusion for users who depend on reactivity to manage complex notebook state.
  • Figure 2: aliasing_val_swap Notebook Benchmark: A benchmark contains an original notebook and a modified notebook. The modification in this example is highlighted in yellow.
  • Figure 3: Correctness Results By Modification: Overview of benchmark counts delineated by whether or not systems correctly react to in-scope benchmarks (in-scope match vs. in-scope mismatch), or detect and alert users of out-of-scope features prior to execution for reducing undefined behaviors from reactivity (out-of-scope caught vs. out-of-scope not caught). For Observable, 51 benchmarks were marked (NA) for incomparability due to execution order differences, file system restrictions, and untranslatable benchmarks containing Python libraries.
  • Figure 4: Collection And Function Mutations: Mutation benchmarks containing collections or through functions can be challenging for reactive systems to balance consistency with efficiency. Cell 2 in list_subscript_redef_2 and cell 1 in func_list_append above contains the modification, depicted with the original code crossed-out and replaced by the user modification.
  • Figure 5: In-Scope Matching Rerun Ratios: Average rerun ratios of each reactive system on all in-scope, matching benchmarks across modification categories.