Table of Contents
Fetching ...

Dias: Dynamic Rewriting of Pandas Code

Stefanos Baziotis, Daniel Kang, Charith Mendis

TL;DR

Dias presents a novel external, source-to-source dynamic rewriter that accelerates pandas-based, ad-hoc EDA workloads on a single machine by rewriting notebook cells across Python and pandas. It combines a fast pattern matcher with a runtime-precondition-aware rewriter to ensure semantic equivalence, enabling cross-library rewrites and just-in-time adaptations in IPython environments. Empirical results on real Kaggle notebooks show per-cell speedups up to 57$ imes$ and notebook speedups up to 3.6$ imes$ (and up to 27.1$ imes$ vs modin) with minimal overhead and no extra memory usage. Dias demonstrates the practicality of automatic, dynamic rewrites for interactive data analysis, offering significant performance improvements without requiring users to abandon familiar pandas APIs or learn new libraries.

Abstract

In recent years, dataframe libraries, such as pandas have exploded in popularity. Due to their flexibility, they are increasingly used in ad-hoc exploratory data analysis (EDA) workloads. These workloads are diverse, including custom functions which can span libraries or be written in pure Python. The majority of systems available to accelerate EDA workloads focus on bulk-parallel workloads, which contain vastly different computational patterns, typically within a single library. As a result, they can introduce excessive overheads for ad-hoc EDA workloads due to their expensive optimization techniques. Instead, we identify program rewriting as a lightweight technique which can offer substantial speedups while also avoiding slowdowns. We implemented our techniques in Dias, which rewrites notebook cells to be more efficient for ad-hoc EDA workloads. We develop techniques for efficient rewrites in Dias, including dynamic checking of preconditions under which rewrites are correct and just-in-time rewrites for notebook environments. We show that Dias can rewrite individual cells to be 57$\times$ faster compared to pandas and 1909$\times$ faster compared to optimized systems such as modin. Furthermore, Dias can accelerate whole notebooks by up to 3.6$\times$ compared to pandas and 26.4$\times$ compared to modin.

Dias: Dynamic Rewriting of Pandas Code

TL;DR

Dias presents a novel external, source-to-source dynamic rewriter that accelerates pandas-based, ad-hoc EDA workloads on a single machine by rewriting notebook cells across Python and pandas. It combines a fast pattern matcher with a runtime-precondition-aware rewriter to ensure semantic equivalence, enabling cross-library rewrites and just-in-time adaptations in IPython environments. Empirical results on real Kaggle notebooks show per-cell speedups up to 57 and notebook speedups up to 3.6 (and up to 27.1 vs modin) with minimal overhead and no extra memory usage. Dias demonstrates the practicality of automatic, dynamic rewrites for interactive data analysis, offering significant performance improvements without requiring users to abandon familiar pandas APIs or learn new libraries.

Abstract

In recent years, dataframe libraries, such as pandas have exploded in popularity. Due to their flexibility, they are increasingly used in ad-hoc exploratory data analysis (EDA) workloads. These workloads are diverse, including custom functions which can span libraries or be written in pure Python. The majority of systems available to accelerate EDA workloads focus on bulk-parallel workloads, which contain vastly different computational patterns, typically within a single library. As a result, they can introduce excessive overheads for ad-hoc EDA workloads due to their expensive optimization techniques. Instead, we identify program rewriting as a lightweight technique which can offer substantial speedups while also avoiding slowdowns. We implemented our techniques in Dias, which rewrites notebook cells to be more efficient for ad-hoc EDA workloads. We develop techniques for efficient rewrites in Dias, including dynamic checking of preconditions under which rewrites are correct and just-in-time rewrites for notebook environments. We show that Dias can rewrite individual cells to be 57 faster compared to pandas and 1909 faster compared to optimized systems such as modin. Furthermore, Dias can accelerate whole notebooks by up to 3.6 compared to pandas and 26.4 compared to modin.
Paper Structure (46 sections, 25 figures, 2 tables)

This paper contains 46 sections, 25 figures, 2 tables.

Figures (25)

  • Figure 1: Loop which accesses individual elements (source: Kaggle real_nb_for_loop). This loop can be hundreds of times slower in bulk-parallel frameworks like modin, dask etc. and PolaRS, which are not optimized for individual accesses.
  • Figure 2: A rewrite example where we avoid apply(). The rewritten version, which uses vectorized, native execution, can run up to 1000$\times$ faster.
  • Figure 3: Rewrite example that crosses library boundaries, and thus cannot be performed by previous techniques. The rewritten version can be up to 11$\times$ faster.
  • Figure 4: Splitting in pandas and Python. Surprisingly, the pure Python implementation is up to 7$\times$ faster.
  • Figure 5: Dias overview. Dias identifies patterns in the source code, which it rewrites using its rewriter. The optimized version is used only if certain dynamic checks pass, to ensure correctness.
  • ...and 20 more figures