Table of Contents
Fetching ...

Large-scale Evaluation of Notebook Checkpointing with AI Agents

Hanxi Fang, Supawit Chockchowwat, Hari Sundaram, Yongjoo Park

TL;DR

The paper addresses the limited generalizability of human-subject evaluations for notebook checkpointing by introducing an AI-agent-based large-scale evaluation framework. It uses ChatGPT-4o to simulate 1,000 branched data-exploration sessions across 10 sessions, generating 2,848 code blocks on the Titanic dataset and memory-intensive testing with Spotify data. The results show that Kishuboard’s code+data checkpointing improves exploration efficiency (up to 36% faster in some cases) and correctness while keeping notebooks cleaner and scalable to larger data tasks. This work demonstrates a scalable methodology for evaluating data-science tooling and highlights Kishuboard’s practical impact on reducing redundant work and managing state across branches.

Abstract

Saving, or checkpointing, intermediate results during interactive data exploration can potentially boost user productivity. However, existing studies on this topic are limited, as they primarily rely on small-scale experiments with human participants - a fundamental constraint of human subject studies. To address this limitation, we employ AI agents to simulate a large number of complex data exploration scenarios, including revisiting past states and branching into new exploration paths. This strategy enables us to accurately assess the impact of checkpointing while closely mimicking the behavior of real-world data practitioners. Our evaluation results, involving more than 1,000 exploration paths and 2,848 executed code blocks, show that a checkpointing framework for computational notebooks can indeed enhance productivity by minimizing unnecessary code re-executions and redundant variables or code.

Large-scale Evaluation of Notebook Checkpointing with AI Agents

TL;DR

The paper addresses the limited generalizability of human-subject evaluations for notebook checkpointing by introducing an AI-agent-based large-scale evaluation framework. It uses ChatGPT-4o to simulate 1,000 branched data-exploration sessions across 10 sessions, generating 2,848 code blocks on the Titanic dataset and memory-intensive testing with Spotify data. The results show that Kishuboard’s code+data checkpointing improves exploration efficiency (up to 36% faster in some cases) and correctness while keeping notebooks cleaner and scalable to larger data tasks. This work demonstrates a scalable methodology for evaluating data-science tooling and highlights Kishuboard’s practical impact on reducing redundant work and managing state across branches.

Abstract

Saving, or checkpointing, intermediate results during interactive data exploration can potentially boost user productivity. However, existing studies on this topic are limited, as they primarily rely on small-scale experiments with human participants - a fundamental constraint of human subject studies. To address this limitation, we employ AI agents to simulate a large number of complex data exploration scenarios, including revisiting past states and branching into new exploration paths. This strategy enables us to accurately assess the impact of checkpointing while closely mimicking the behavior of real-world data practitioners. Our evaluation results, involving more than 1,000 exploration paths and 2,848 executed code blocks, show that a checkpointing framework for computational notebooks can indeed enhance productivity by minimizing unnecessary code re-executions and redundant variables or code.

Paper Structure

This paper contains 24 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Background about Kishuboard. The history graph (purple box) shows past commits. The code and variable panes (yellow box) display the information of a selected commit. From any past commit, users can load data only (i.e., execution rollback) or load both code and data (i.e., checkout) using the navigation popup (red box). The image is reproduced with the original authors' permission.
  • Figure 2: A toy example of user's intended data exploration and strategies to execute it. NaïveRestart repetitively executes cells $c_1$ and $c_2$. NaïveContinue executes new cells ($c_{3'}$ and $c_{4'}$) without any kernel restart. NaïveContinue may lead to branch interferences, for example, N/A values were already dropped by $c_3$, making data imputation in $c_{3'}$ ineffective. Kishuboard restores checkpointed data to explore a new path, thus removing the repetitive work and preventing potential branch interferences.
  • Figure 3: End-to-end execution time for Kishuboard and baseline methods. We generated 1000 branches of code using LLM-Agent, divided into 10 exploration sessions with 100 branches each. The sessions are sorted in ascending order by NaïveRestart time. NaïveContinue method is the fastest in terms of execution time, as it only runs newly added cells without checkpoint or checkout overhead. However, it is faulty, often producing incorrect results that do not trigger explicit errors, which may require significant debugging time. The additional time for the Kishuboard group is due entirely to checkpointing and checkout overhead, with the worst-case average overhead being just 2 seconds per branch. The red annotations indicate the number of implicit incorrect results for each session.
  • Figure 4: The peak number of kernel variables for each session during exploration. Smaller numbers of variables may be preferred for easier understanding. NaïveContinue produces excessive variables, increasing cognitive load to keep track of variables across branches. Kishuboard has exactly two more variables than NaïveRestart for user-invisible metadata.