Table of Contents
Fetching ...

Is Programming by Example solved by LLMs?

Wen-Ding Li, Kevin Ellis

TL;DR

It is found that pretrained models are not effective at PBE, but that they can be fine-tuned for much higher performance, provided the test problems are in-distribution.

Abstract

Programming-by-Examples (PBE) aims to generate an algorithm from input-output examples. Such systems are practically and theoretically important: from an end-user perspective, they are deployed to millions of people, and from an AI perspective, PBE corresponds to a very general form of few-shot inductive inference. Given the success of Large Language Models (LLMs) in code-generation tasks, we investigate here the extent to which LLMs can be said to have "solved" PBE. We experiment on classic domains such as lists and strings, and an uncommon graphics programming domain not well represented in typical pretraining data. We find that pretrained models are not effective at PBE, but that they can be fine-tuned for much higher performance, provided the test problems are in-distribution. We analyze empirically what causes these models to succeed and fail, and take steps toward understanding how to achieve better out-of-distribution generalization. Collectively these results suggest that LLMs make strong progress toward solving the typical suite of PBE tasks, potentially increasing the flexibility and applicability of PBE systems, while also identifying ways in which LLMs still fall short.

Is Programming by Example solved by LLMs?

TL;DR

It is found that pretrained models are not effective at PBE, but that they can be fine-tuned for much higher performance, provided the test problems are in-distribution.

Abstract

Programming-by-Examples (PBE) aims to generate an algorithm from input-output examples. Such systems are practically and theoretically important: from an end-user perspective, they are deployed to millions of people, and from an AI perspective, PBE corresponds to a very general form of few-shot inductive inference. Given the success of Large Language Models (LLMs) in code-generation tasks, we investigate here the extent to which LLMs can be said to have "solved" PBE. We experiment on classic domains such as lists and strings, and an uncommon graphics programming domain not well represented in typical pretraining data. We find that pretrained models are not effective at PBE, but that they can be fine-tuned for much higher performance, provided the test problems are in-distribution. We analyze empirically what causes these models to succeed and fail, and take steps toward understanding how to achieve better out-of-distribution generalization. Collectively these results suggest that LLMs make strong progress toward solving the typical suite of PBE tasks, potentially increasing the flexibility and applicability of PBE systems, while also identifying ways in which LLMs still fall short.
Paper Structure (36 sections, 5 equations, 12 figures, 4 tables)

This paper contains 36 sections, 5 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Domains, including standard ones that resemble programs found in pretraining data, as well as a less common graphics domain, which is likely less represented in LLM pretraining data.
  • Figure 2: Left: Data generation pipeline. Right: The fine-tuned network $q_\theta$ learns to do inference in a graphical model where the prior over programs, $\mathcal{G}$, is defined by prompting an LLM with example code in $\mathcal{D}_\text{seed}$, while the likelihood $p(Y|\rho, X)$ is defined by program execution.
  • Figure 3: Test set performance. A problem is solved if the predicted program generates correct outputs on the holdout inputs. Metagol 10.1007/s10994-014-5471-y, RobustFill robust, and Fleet fleet results taken from rule2024symbolic
  • Figure 4: PBE with LLMs allows using general-purpose programming languages which can mix string and numerical operations in ways not allowed by domain-specific languages cambronero2023flashfill++ (top), and allows world knowledge to inform code generation (bottom). I/Os and code partly elided for space.
  • Figure 5: ASCII representation of LOGO graphics. Average pixel intensity indicated by numbers 0-9
  • ...and 7 more figures