Table of Contents
Fetching ...

Themisto: Jupyter-Based Runtime Benchmark

Konstantin Grotov, Sergey Titov

TL;DR

Themisto introduces a runtime-aware benchmark using Jupyter notebook development trajectories to evaluate how well large language models can utilize dynamic runtime context for code tasks. It defines two tasks—next cell prediction and output prediction—and evaluates several foundation models with and without runtime information, using metrics such as exact match, ROUGE-L, and ChrF. The study finds that current models struggle on these tasks and that runtime context provides limited, variable benefits, suggesting the need for curated, task-specific context and further fine-tuning. By releasing the JuNE-derived trajectories and benchmark data on Zenodo, the work lays groundwork for broader exploration of dynamic, runtime-aware code generation and interactive development assistance.

Abstract

In this work, we present a benchmark that consists of Jupyter notebooks development trajectories and allows measuring how large language models (LLMs) can leverage runtime information for predicting code output and code generation. We demonstrate that the current generation of LLMs performs poorly on these tasks and argue that there exists a significantly understudied domain in the development of code-based models, which involves incorporating the runtime context.

Themisto: Jupyter-Based Runtime Benchmark

TL;DR

Themisto introduces a runtime-aware benchmark using Jupyter notebook development trajectories to evaluate how well large language models can utilize dynamic runtime context for code tasks. It defines two tasks—next cell prediction and output prediction—and evaluates several foundation models with and without runtime information, using metrics such as exact match, ROUGE-L, and ChrF. The study finds that current models struggle on these tasks and that runtime context provides limited, variable benefits, suggesting the need for curated, task-specific context and further fine-tuning. By releasing the JuNE-derived trajectories and benchmark data on Zenodo, the work lays groundwork for broader exploration of dynamic, runtime-aware code generation and interactive development assistance.

Abstract

In this work, we present a benchmark that consists of Jupyter notebooks development trajectories and allows measuring how large language models (LLMs) can leverage runtime information for predicting code output and code generation. We demonstrate that the current generation of LLMs performs poorly on these tasks and argue that there exists a significantly understudied domain in the development of code-based models, which involves incorporating the runtime context.

Paper Structure

This paper contains 12 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: A sample of code-output trajectory pairs for the output prediction task (on the left side) and next cell prediction task (on the right side). The gray and white rows represent the content of the trajectory, including the cell content and the cell output, while the green indicates the entity we aim to predict.
  • Figure 2: Diversity metrics comparison between output and code.