Programming Puzzles

Tal Schuster; Ashwin Kalyan; Oleksandr Polozov; Adam Tauman Kalai

Programming Puzzles

Tal Schuster, Ashwin Kalyan, Oleksandr Polozov, Adam Tauman Kalai

TL;DR

Python Programming Puzzles (P3) introduces a formal, objective evaluation framework for program synthesis using verifiers $f$ and inputs $x$, and releases an open-source dataset of 397 Python puzzles spanning diverse domains from simple syntax to longstanding open problems. The paper develops enumerative AST-based solvers and autoregressive LM solvers (GPT-3 and Codex), showing bootstrapping from past solutions improves performance, with Codex solving up to about 80% of problems given enough tries. A small user study indicates puzzle solving correlates with coding experience, and that human and AI difficulty align, validating puzzles as a benchmark for algorithmic problem-solving progress. The dataset and baselines provide a platform to push advances in program synthesis and could influence code completion, automatic debugging, and the exploration of hard algorithmic problems.

Abstract

We introduce a new type of programming challenge called programming puzzles, as an objective and comprehensive evaluation of program synthesis, and release an open-source dataset of Python Programming Puzzles (P3). Each puzzle is defined by a short Python program $f$, and the goal is to find an input which makes $f$ return True. The puzzles are objective in that each one is specified entirely by the source code of its verifier $f$, so evaluating $f$ is all that is needed to test a candidate solution. They do not require an answer key or input/output examples, nor do they depend on natural language understanding. The dataset is comprehensive in that it spans problems of a range of difficulties and domains, ranging from trivial string manipulation problems, to classic programming puzzles (e.g., Tower of Hanoi), to interview/competitive-programming problems (e.g., dynamic programming), to longstanding open problems in algorithms and mathematics (e.g., factoring). We develop baseline enumerative program synthesis, GPT-3 and Codex solvers that are capable of solving puzzles -- even without access to any reference solutions -- by learning from their own past solutions. Codex performs best, solving up to 18% of 397 test problems with a single try and 80% of the problems with 1,000 tries per problem. In a small user study, we find a positive correlation between puzzle-solving performance and coding experience, and between the puzzle difficulty for humans and AI solvers. Therefore, further improvements on P3 could have a significant impact on many program synthesis areas.

Programming Puzzles

TL;DR

Python Programming Puzzles (P3) introduces a formal, objective evaluation framework for program synthesis using verifiers

and inputs

, and releases an open-source dataset of 397 Python puzzles spanning diverse domains from simple syntax to longstanding open problems. The paper develops enumerative AST-based solvers and autoregressive LM solvers (GPT-3 and Codex), showing bootstrapping from past solutions improves performance, with Codex solving up to about 80% of problems given enough tries. A small user study indicates puzzle solving correlates with coding experience, and that human and AI difficulty align, validating puzzles as a benchmark for algorithmic problem-solving progress. The dataset and baselines provide a platform to push advances in program synthesis and could influence code completion, automatic debugging, and the exploration of hard algorithmic problems.

Abstract

, and the goal is to find an input which makes

return True. The puzzles are objective in that each one is specified entirely by the source code of its verifier

, so evaluating

is all that is needed to test a candidate solution. They do not require an answer key or input/output examples, nor do they depend on natural language understanding. The dataset is comprehensive in that it spans problems of a range of difficulties and domains, ranging from trivial string manipulation problems, to classic programming puzzles (e.g., Tower of Hanoi), to interview/competitive-programming problems (e.g., dynamic programming), to longstanding open problems in algorithms and mathematics (e.g., factoring). We develop baseline enumerative program synthesis, GPT-3 and Codex solvers that are capable of solving puzzles -- even without access to any reference solutions -- by learning from their own past solutions. Codex performs best, solving up to 18% of 397 test problems with a single try and 80% of the problems with 1,000 tries per problem. In a small user study, we find a positive correlation between puzzle-solving performance and coding experience, and between the puzzle difficulty for humans and AI solvers. Therefore, further improvements on P3 could have a significant impact on many program synthesis areas.

Programming Puzzles

TL;DR

Abstract

Programming Puzzles

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (28)

Theorems & Definitions (1)