Table of Contents
Fetching ...

PHYRE: A New Benchmark for Physical Reasoning

Anton Bakhtin, Laurens van der Maaten, Justin Johnson, Laura Gustafson, Ross Girshick

TL;DR

PHYRE presents a two-tier, 2D deterministic physics benchmark to evaluate agents' physical reasoning under strict sample-efficiency and generalization demands. It defines tasks as goals achieved by placing dynamic bodies in a Newtonian world and measures performance with the AUCCESS metric, emphasizing few-attempt success. The study compares baselines including RAND, MEM, and DQN variants, finding that online learning offers advantages but current methods still struggle with cross-template generalization and fast solution discovery, underscoring the need for counterfactual reasoning and forward models. The benchmark is designed to be extensible, promoting development of compact, transferable physical models and robust generalization across diverse puzzles.

Abstract

Understanding and reasoning about physics is an important ability of intelligent agents. We develop the PHYRE benchmark for physical reasoning that contains a set of simple classical mechanics puzzles in a 2D physical environment. The benchmark is designed to encourage the development of learning algorithms that are sample-efficient and generalize well across puzzles. We test several modern learning algorithms on PHYRE and find that these algorithms fall short in solving the puzzles efficiently. We expect that PHYRE will encourage the development of novel sample-efficient agents that learn efficient but useful models of physics. For code and to play PHYRE for yourself, please visit https://player.phyre.ai.

PHYRE: A New Benchmark for Physical Reasoning

TL;DR

PHYRE presents a two-tier, 2D deterministic physics benchmark to evaluate agents' physical reasoning under strict sample-efficiency and generalization demands. It defines tasks as goals achieved by placing dynamic bodies in a Newtonian world and measures performance with the AUCCESS metric, emphasizing few-attempt success. The study compares baselines including RAND, MEM, and DQN variants, finding that online learning offers advantages but current methods still struggle with cross-template generalization and fast solution discovery, underscoring the need for counterfactual reasoning and forward models. The benchmark is designed to be extensible, promoting development of compact, transferable physical models and robust generalization across diverse puzzles.

Abstract

Understanding and reasoning about physics is an important ability of intelligent agents. We develop the PHYRE benchmark for physical reasoning that contains a set of simple classical mechanics puzzles in a 2D physical environment. The benchmark is designed to encourage the development of learning algorithms that are sample-efficient and generalize well across puzzles. We test several modern learning algorithms on PHYRE and find that these algorithms fall short in solving the puzzles efficiently. We expect that PHYRE will encourage the development of novel sample-efficient agents that learn efficient but useful models of physics. For code and to play PHYRE for yourself, please visit https://player.phyre.ai.

Paper Structure

This paper contains 15 sections, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Three examples of PHYRE tasks (left) and one example solution (right). Black objects are static; objects with any other color are dynamic and subject to gravity. The tasks describe a terminal goal state that can be achieved by placing additional object(s) in the world and running the simulator. The task in the left-most pane requires placement of two balls to be solved, whereas the others can be solved with one ball. The right-most pane illustrates a solution (red ball) and the solution dynamics.
  • Figure 2: PHYRE complexity analysis. Values are averaged over 10 runs over all tasks in the tier; error bars indicate one standard deviation. Two-ball tasks are much harder to solve by chance than single ball tasks. Each tier contains a spectrum of task difficulty with respect to random guessing.
  • Figure 3: Percentage of solved tasks (success percentage) as a function of the number of attempts per task of five agents on PHYRE-{B, 2B} in the within-template and cross-template settings. Success percentages are averaged over all test tasks and 10 folds. Shaded regions show one standard deviation.
  • Figure 4: AUCCESS as a function of the number of actions being ranked by the agent for the RANDOM, MEM, and DQN agents and for an agent that is OPTIMAL in terms of scoring attempts.
  • Figure 5: AUCCESS of MEM-O and DQN-O agents as the "aggressiveness" of the online update is varied during the testing phase. The left-most point in each plot is an offline version of the agent.
  • ...and 6 more figures