Table of Contents
Fetching ...

ReGAL: Refactoring Programs to Discover Generalizable Abstractions

Elias Stengel-Eskin, Archiki Prasad, Mohit Bansal

TL;DR

ReGAL introduces a gradient-free framework to discover and verify reusable abstractions by refactoring small sets of primitive programs into a library of helper functions. The method iteratively refactors, tests, and prunes abstractions, enabling cross-task generalization and improving LLM-based program prediction across five diverse domains. Empirical results show notable accuracy gains for open-source LLMs and robust performance under distribution shifts, with analyses revealing reusable subroutines and environment dynamics encoded in learned abstractions. This approach shifts some predictive burden from regenerating primitive code to exploiting a shared, verifiable library to boost generalization and efficiency.

Abstract

While large language models (LLMs) are increasingly being used for program synthesis, they lack the global view needed to develop useful abstractions; they generally predict programs one at a time, often repeating the same functionality. Generating redundant code from scratch is both inefficient and error-prone. To address this, we propose Refactoring for Generalizable Abstraction Learning (ReGAL), a gradient-free method for learning a library of reusable functions via code refactorization, i.e., restructuring code without changing its execution output. ReGAL learns from a small set of existing programs, iteratively verifying and refining its abstractions via execution. We find that the shared function libraries discovered by ReGAL make programs easier to predict across diverse domains. On five datasets -- LOGO graphics generation, Date reasoning, TextCraft (a Minecraft-based text-game) MATH, and TabMWP -- both open-source and proprietary LLMs improve in accuracy when predicting programs with ReGAL functions. For CodeLlama-13B, ReGAL results in absolute accuracy increases of 11.5% on LOGO, 26.1% on date understanding, and 8.1% on TextCraft, outperforming GPT-3.5 in two of three domains. Our analysis reveals ReGAL's abstractions encapsulate frequently-used subroutines as well as environment dynamics.

ReGAL: Refactoring Programs to Discover Generalizable Abstractions

TL;DR

ReGAL introduces a gradient-free framework to discover and verify reusable abstractions by refactoring small sets of primitive programs into a library of helper functions. The method iteratively refactors, tests, and prunes abstractions, enabling cross-task generalization and improving LLM-based program prediction across five diverse domains. Empirical results show notable accuracy gains for open-source LLMs and robust performance under distribution shifts, with analyses revealing reusable subroutines and environment dynamics encoded in learned abstractions. This approach shifts some predictive burden from regenerating primitive code to exploiting a shared, verifiable library to boost generalization and efficiency.

Abstract

While large language models (LLMs) are increasingly being used for program synthesis, they lack the global view needed to develop useful abstractions; they generally predict programs one at a time, often repeating the same functionality. Generating redundant code from scratch is both inefficient and error-prone. To address this, we propose Refactoring for Generalizable Abstraction Learning (ReGAL), a gradient-free method for learning a library of reusable functions via code refactorization, i.e., restructuring code without changing its execution output. ReGAL learns from a small set of existing programs, iteratively verifying and refining its abstractions via execution. We find that the shared function libraries discovered by ReGAL make programs easier to predict across diverse domains. On five datasets -- LOGO graphics generation, Date reasoning, TextCraft (a Minecraft-based text-game) MATH, and TabMWP -- both open-source and proprietary LLMs improve in accuracy when predicting programs with ReGAL functions. For CodeLlama-13B, ReGAL results in absolute accuracy increases of 11.5% on LOGO, 26.1% on date understanding, and 8.1% on TextCraft, outperforming GPT-3.5 in two of three domains. Our analysis reveals ReGAL's abstractions encapsulate frequently-used subroutines as well as environment dynamics.
Paper Structure (35 sections, 10 figures, 15 tables, 2 algorithms)

This paper contains 35 sections, 10 figures, 15 tables, 2 algorithms.

Figures (10)

  • Figure 1: ReGAL trains by refactoring primitive-only programs into abstractions that are verified and stored. These abstractions have two benefits: Reusability: Rewriting the same code multiple times leads to errors; Abstraction: ReGAL makes prediction easier by allowing matching between the query and the abstractions.
  • Figure 2: ReGAL starts by refactoring a batch of primitive programs to develop a set of modified programs and helper functions (Stage 1). It then verifies the results of refactored programs, optionally retrying failed programs according to environment feedback. Useful helper functions are added to the Code Bank along with example usage added to the Demo Bank (Stage 2). Periodically, we edit and prune the Code Bank to improve its functions (Stage 3). At test time, the ReGAL agent has access to the Code Bank, the Demo Bank, and the remaining original programs. It is compared against a baseline agent which has access to a larger number of original programs.
  • Figure 3: Function usage by CodeLlama-13B for the top-5 most common helpers illustrating reusability across examples. The x-axis denotes the number of times a functions is used in the test set.
  • Figure 4: ReGAL programs yield a higher success rate (accuracy) compared to primitive programs on TextCraft for different sizes of training set $X$ using CodeLlama-13B.
  • Figure 5: Examples of discovered programs for LOGO as mentioned in \ref{['fig:fig1', 'fig:usage']}. As the name suggests, $\tt{draw\_small\_5gon}$() draws a small-size pentagon and $\tt{draw\_semicircle}$() draws a small semicircle.
  • ...and 5 more figures