Themisto: Jupyter-Based Runtime Benchmark
Konstantin Grotov, Sergey Titov
TL;DR
Themisto introduces a runtime-aware benchmark using Jupyter notebook development trajectories to evaluate how well large language models can utilize dynamic runtime context for code tasks. It defines two tasks—next cell prediction and output prediction—and evaluates several foundation models with and without runtime information, using metrics such as exact match, ROUGE-L, and ChrF. The study finds that current models struggle on these tasks and that runtime context provides limited, variable benefits, suggesting the need for curated, task-specific context and further fine-tuning. By releasing the JuNE-derived trajectories and benchmark data on Zenodo, the work lays groundwork for broader exploration of dynamic, runtime-aware code generation and interactive development assistance.
Abstract
In this work, we present a benchmark that consists of Jupyter notebooks development trajectories and allows measuring how large language models (LLMs) can leverage runtime information for predicting code output and code generation. We demonstrate that the current generation of LLMs performs poorly on these tasks and argue that there exists a significantly understudied domain in the development of code-based models, which involves incorporating the runtime context.
