Goals as Reward-Producing Programs

Guy Davidson; Graham Todd; Julian Togelius; Todd M. Gureckis; Brenden M. Lake

Goals as Reward-Producing Programs

Guy Davidson, Graham Todd, Julian Togelius, Todd M. Gureckis, Brenden M. Lake

TL;DR

A dataset of human-generated playful goals is collected, modelling them as reward-producing programs and generating novel human-like goals through program synthesis, finding that model-generated goals, when sampled from partitions of program space occupied by human examples, were indistinguishable from human-created games.

Abstract

People are remarkably capable of generating their own goals, beginning with child's play and continuing into adulthood. Despite considerable empirical and computational work on goals and goal-oriented behavior, models are still far from capturing the richness of everyday human goals. Here, we bridge this gap by collecting a dataset of human-generated playful goals (in the form of scorable, single-player games), modeling them as reward-producing programs, and generating novel human-like goals through program synthesis. Reward-producing programs capture the rich semantics of goals through symbolic operations that compose, add temporal constraints, and allow for program execution on behavioral traces to evaluate progress. To build a generative model of goals, we learn a fitness function over the infinite set of possible goal programs and sample novel goals with a quality-diversity algorithm. Human evaluators found that model-generated goals, when sampled from partitions of program space occupied by human examples, were indistinguishable from human-created games. We also discovered that our model's internal fitness scores predict games that are evaluated as more fun to play and more human-like.

Goals as Reward-Producing Programs

TL;DR

Abstract

Paper Structure (34 sections, 2 equations, 18 figures, 10 tables, 1 algorithm)

This paper contains 34 sections, 2 equations, 18 figures, 10 tables, 1 algorithm.

Pseudocode and program summary translation
Natural language to domain-specific language translation analyses
Full feature set
Features Most Predictive of Real or Regrown Games
Objective function algorithm descriptions
MAP-Elites algorithm details
DSL to natural language back-translation
Model sample and real game edit distance similarity
Highest fitness games
Human evaluations data analysis
Detailed human evaluation results
Fitness-less mixed models analysis
Fitness-inclusive mixed-effect model analyses
Matched-real game similarity analysis
Random effects analysis
...and 19 more sections

Figures (18)

Figure 1: Goals as Reward-Producing Programs. Panels a-d show different goals, presented in natural language and mapped to pseudo-code in a program-like representation. Panel e shows a set of varied yet related goals in our experiment environment, of which the blue and pink were created by participants in our experiment. Each goal is represented by a throw trajectory (dashed line in the illustration) matching a description of the goal (whose text is the same color as the line). We highlight shared compositional components between programs in yellow, orange, and green. Our program representations are reward-producing, that is, run on sequences of agent interactions with an environment (state-action pairs) and emit a score with respect to the specified goal. Our pseudo-code and domain-specific language both use a LISP-like syntax, where function calls have the function name as the first token inside the parentheses. Participants in our experiment created some of these goals; see \ref{['fig:appendix-pseudocode-translation']} for representations of the blue and pink programs in our domain-specific language.
Figure 2: Participants in our behavioral experiment create diverse games reflecting common sense and compositionality.(a): Our online game creation experiment (see full interface in \ref{['fig:experiment-interface']}. (b): Participants showcase intuitive common sense. Left: In games involving exclusively throwing, participants use balls (orange) far more often than any other object type. Right: In other games, participants refer to blocks or "any object" more often, most often checking where objects are placed (using the and predicates). We most often observe balls being thrown and blocks being stacked, and while a few participants specified block-throwing games, no participant created a game involving ball-stacking. Participants also rarely specified throwing large or cumbersome objects (such as the chair or laptop), and only used buildings to specify stacking objectives (as opposed to moving or throwing them). See \ref{['fig:behavioral-common-sense']} for an extended version of this panel (including additional object categories and predicate). (c): We analyze the occurrence of various abstract structures in our programs (see \ref{['methods:dataset-analyses']} for details). Red: The five most common structures cover almost half (47.5%) of total occurrences, showing extensive compositional reuse. The three most common structures combine into simple ball-to-bin throwing preference ((1), structure indices in square brackets). Purple: Other structures are reused fewer times, covering most remaining occurrences (another 40.5%). These rarer structures allow for creating more complex throwing elements, constraining where the player throws the ball from (2,3) or to (3). Blue: Exactly half of the structures (63 / 126) appear only once --- this long tail of expressions offers evidence of creativity. The last throwing preference (4), specifying throwing a block from the rug onto the desk without moving off the rug or breaking any of the objects on the desk, uses two unique structures.
Figure 3: Goal Program Generator model.(a) Overview: Our model operates on programs in some high-dimensional space (visualized in two dimensions). We learn a fitness metric (Z-axis) capturing desirable aspects of programs using a dataset of human-created goals (highlighted in green). Our model then generates diverse new samples maximizing the fitness measure, some "matched" to participant-created goal programs on diversity criteria (in blue) and other "unmatched" novel goals (in purple). These programs stand in contrast to potential failure modes, such as generating programs that are malformed or semantically incoherent (in red). All (non-red) goals in this figure were created by participants in our experiment or our model; see \ref{['fig:appendix-pseudocode-translation']} for their full representations in our domain-specific language. (b) Parameter learning: We contrastively learn a quantitative measure of fitness (the Z axis in (a)) by maximizing the distance between human-generated exemplar games and a set of corruptions obtained through random tree regrowth. (c) Search: This measure is then used as the basis for quality-diversity optimization using MAP-Elites. The algorithm maintains an archive of games that differ across phenotypic "behavioral characteristics." At each step, a game is randomly sampled from the archive (1), randomly mutated (2), and re-evaluated for fitness and its position in the archive. It is added to the archive only if it would occupy a previously empty position or if it is more fit than the current occupant (3).
Figure 4: Goal Program Generator model produces simple, coherent, human-like games. Each pair of games in a column has the same set of MAP-Elites behavioral characteristics (a real participant-created game and the corresponding "matched" model-generated one). Parentheses: the fitness score assigned by the model to each game. Natural language descriptions are generated through automated back-translation from programs (see \ref{['sec:appendix-backtranslation']} for details). To ascertain that the model-generated programs are distinct from training set examples, we also provide in \ref{['fig:appendix-edit-distance']} the most similar real exemplar using an edit distance, and see \ref{['sec:appendix-model-sample-edit-distance']} for details.
Figure 5: Goal Program Generator model produces interesting, novel goals. Each of the three games below has high fitness and fills an "unmatched" cell in the MAP-Elites archive, with no corresponding human game in our dataset. Parentheses: the fitness score assigned by the model to each game.
...and 13 more figures

Goals as Reward-Producing Programs

TL;DR

Abstract

Goals as Reward-Producing Programs

Authors

TL;DR

Abstract

Table of Contents

Figures (18)