Table of Contents
Fetching ...

RuleFlow : Generating Reusable Program Optimizations with LLMs

Avaljot Singh, Dushyant Bharadwaj, Stefanos Baziotis, Kaushik Varadharajan, Charith Mendis

TL;DR

RuleFlow addresses the challenge of optimizing Pandas programs by combining offline, LLM-driven discovery with a deterministic, rule-based deployment pipeline. It introduces three stages—SnippetGen for discovering candidate optimizations, RuleGen for converting them into reusable rewrite rules, and CodeGen for applying them to unseen code via static pattern matching. The approach yields state-of-the-art end-to-end speedups on PandasBench, surpassing prior compiler-based (Dias) and systems-based (Modin) baselines, and results in a bank of high-quality rewrite rules that generalize across notebooks. By decoupling discovery from deployment, RuleFlow delivers scalable performance improvements while avoiding the high costs and unreliability of per-program LLM optimization, offering a practical path to broader adoption of automated Pandas optimizations.

Abstract

Optimizing Pandas programs is a challenging problem. Existing systems and compiler-based approaches offer reliability but are either heavyweight or support only a limited set of optimizations. Conversely, using LLMs in a per-program optimization methodology can synthesize nontrivial optimizations, but is unreliable, expensive, and offers a low yield. In this work, we introduce a hybrid approach that works in a 3-stage manner that decouples discovery from deployment and connects them via a novel bridge. First, it discovers per-program optimizations (discovery). Second, they are converted into generalised rewrite rules (bridge). Finally, these rules are incorporated into a compiler that can automatically apply them wherever applicable, eliminating repeated reliance on LLMs (deployment). We demonstrate that RuleFlow is the new state-of-the-art (SOTA) Pandas optimization framework on PandasBench, a challenging Pandas benchmark consisting of Python notebooks. Across these notebooks, we achieve a speedup of up to 4.3x over Dias, the previous compiler-based SOTA, and 1914.9x over Modin, the previous systems-based SOTA. Our code is available at https://github.com/ADAPT-uiuc/RuleFlow.

RuleFlow : Generating Reusable Program Optimizations with LLMs

TL;DR

RuleFlow addresses the challenge of optimizing Pandas programs by combining offline, LLM-driven discovery with a deterministic, rule-based deployment pipeline. It introduces three stages—SnippetGen for discovering candidate optimizations, RuleGen for converting them into reusable rewrite rules, and CodeGen for applying them to unseen code via static pattern matching. The approach yields state-of-the-art end-to-end speedups on PandasBench, surpassing prior compiler-based (Dias) and systems-based (Modin) baselines, and results in a bank of high-quality rewrite rules that generalize across notebooks. By decoupling discovery from deployment, RuleFlow delivers scalable performance improvements while avoiding the high costs and unreliability of per-program LLM optimization, offering a practical path to broader adoption of automated Pandas optimizations.

Abstract

Optimizing Pandas programs is a challenging problem. Existing systems and compiler-based approaches offer reliability but are either heavyweight or support only a limited set of optimizations. Conversely, using LLMs in a per-program optimization methodology can synthesize nontrivial optimizations, but is unreliable, expensive, and offers a low yield. In this work, we introduce a hybrid approach that works in a 3-stage manner that decouples discovery from deployment and connects them via a novel bridge. First, it discovers per-program optimizations (discovery). Second, they are converted into generalised rewrite rules (bridge). Finally, these rules are incorporated into a compiler that can automatically apply them wherever applicable, eliminating repeated reliance on LLMs (deployment). We demonstrate that RuleFlow is the new state-of-the-art (SOTA) Pandas optimization framework on PandasBench, a challenging Pandas benchmark consisting of Python notebooks. Across these notebooks, we achieve a speedup of up to 4.3x over Dias, the previous compiler-based SOTA, and 1914.9x over Modin, the previous systems-based SOTA. Our code is available at https://github.com/ADAPT-uiuc/RuleFlow.
Paper Structure (61 sections, 1 equation, 11 figures, 2 tables)

This paper contains 61 sections, 1 equation, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Example optimization proposed by LLM. The optimized code is $1770\times$ faster on a representative dataframe size.
  • Figure 2: This rule optimizes single column deletion by replacing drop with pop. This rewrite achieves a mean speedup of $18.31\times$ with a maximum speedup of $130.22\times$ on PandasBench.
  • Figure 3: RuleFlow is a Pandas optimization framework with three stages: (i) SnippetGen discovers candidate optimizations from code snippets, (ii) RuleGen converts these optimizations into rewrite rules, and (iii) CodeGen applies the rules to unseen Pandas programs. The framework segregates LLM-based optimization discovery and compiler-based efficient rule deployment.
  • Figure 4: Example application of a rewrite rule by CodeGen.
  • Figure 5: Speedups of different frameworks over Pandas across 102 notebooks. Higher curves indicate better overall performance. For the most part, RuleFlow dominates existing frameworks, establishing a new SOTA Pandas optimization framework.
  • ...and 6 more figures