Large-scale Evaluation of Notebook Checkpointing with AI Agents
Hanxi Fang, Supawit Chockchowwat, Hari Sundaram, Yongjoo Park
TL;DR
The paper addresses the limited generalizability of human-subject evaluations for notebook checkpointing by introducing an AI-agent-based large-scale evaluation framework. It uses ChatGPT-4o to simulate 1,000 branched data-exploration sessions across 10 sessions, generating 2,848 code blocks on the Titanic dataset and memory-intensive testing with Spotify data. The results show that Kishuboard’s code+data checkpointing improves exploration efficiency (up to 36% faster in some cases) and correctness while keeping notebooks cleaner and scalable to larger data tasks. This work demonstrates a scalable methodology for evaluating data-science tooling and highlights Kishuboard’s practical impact on reducing redundant work and managing state across branches.
Abstract
Saving, or checkpointing, intermediate results during interactive data exploration can potentially boost user productivity. However, existing studies on this topic are limited, as they primarily rely on small-scale experiments with human participants - a fundamental constraint of human subject studies. To address this limitation, we employ AI agents to simulate a large number of complex data exploration scenarios, including revisiting past states and branching into new exploration paths. This strategy enables us to accurately assess the impact of checkpointing while closely mimicking the behavior of real-world data practitioners. Our evaluation results, involving more than 1,000 exploration paths and 2,848 executed code blocks, show that a checkpointing framework for computational notebooks can indeed enhance productivity by minimizing unnecessary code re-executions and redundant variables or code.
