Table of Contents
Fetching ...

Reinforcement learning for freeform robot design

Muhan Li, David Matthews, Sam Kriegman

TL;DR

Policy gradients for designing freeform robots with arbitrary external and internal structure are shown through actions that deposit or remove bundles of atomic building blocks to form higher-level nonparametric macrostructures such as appendages, organs and cavities.

Abstract

Inspired by the necessity of morphological adaptation in animals, a growing body of work has attempted to expand robot training to encompass physical aspects of a robot's design. However, reinforcement learning methods capable of optimizing the 3D morphology of a robot have been restricted to reorienting or resizing the limbs of a predetermined and static topological genus. Here we show policy gradients for designing freeform robots with arbitrary external and internal structure. This is achieved through actions that deposit or remove bundles of atomic building blocks to form higher-level nonparametric macrostructures such as appendages, organs and cavities. Although results are provided for open loop control only, we discuss how this method could be adapted for closed loop control and sim2real transfer to physical machines in future.

Reinforcement learning for freeform robot design

TL;DR

Policy gradients for designing freeform robots with arbitrary external and internal structure are shown through actions that deposit or remove bundles of atomic building blocks to form higher-level nonparametric macrostructures such as appendages, organs and cavities.

Abstract

Inspired by the necessity of morphological adaptation in animals, a growing body of work has attempted to expand robot training to encompass physical aspects of a robot's design. However, reinforcement learning methods capable of optimizing the 3D morphology of a robot have been restricted to reorienting or resizing the limbs of a predetermined and static topological genus. Here we show policy gradients for designing freeform robots with arbitrary external and internal structure. This is achieved through actions that deposit or remove bundles of atomic building blocks to form higher-level nonparametric macrostructures such as appendages, organs and cavities. Although results are provided for open loop control only, we discuss how this method could be adapted for closed loop control and sim2real transfer to physical machines in future.
Paper Structure (9 sections, 2 equations, 6 figures, 1 table)

This paper contains 9 sections, 2 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Freeform robot design. A policy (top row) was trained to design increasingly motile robots, and predict their locomotive ability (critic; $V(s)$), using a sequence of 100 design actions that freely position, overwrite or remove bundles of muscles (green and red) and passive tissue (dark blue) to form a robot (middle) whose behavior in simulation (bottom) determines the policy's reward (https://youtu.be/ybaEVDGvkTE).
  • Figure 2: Learning to design large nonparametric bodies.Left: Mean reward (volume; dark gray curve) and its $99\%$ Normal confidence interval (light gray bands) across 5 independent learning trials. Right: six bodies (A-F) sampled along the reward curve from least to most voluminous.
  • Figure 3: Designs for locomotion sampled across 5 independent trials at different epochs across training. Reward is net displacement (in voxel lengths) measured from evaluation start to end.
  • Figure 4: Learning to design freeform robots. Mean (dark gray curve) and $99\%$ Normal confidence intervals (light gray bands) of reward (net displacement in voxel lengths; A), body volume (number of voxels; B), surface voxels to volume ratio (C), passive material ratio (D), largest connected component ratio (E), number of substructures (separate material regions; F), reflection symmetry (G), and compressability (using gzip; H) during policy optimization across 5 independent learning trials. The policy learned to produce larger, more symmetrical bodies with less passive tissue with higher complexity as measured by the number of substructures and compression score.
  • Figure 5: Critic estimation of behavioral reward. Predicted behavioral reward of the untrained (red) and trained (blue) critic against ground truth (gray) for designs generated at each epoch during training. Colored bands denote $99\%$ Normal confidence intervals across the 5 independent trials. The trained critic has learned the concept of locomotion as demonstrated by a significant improvement in prediction ability over the untrained critics. In-domain predictions are computed from bodies that each critic has seen during training (A), and out-of-domain predictions are computed from bodies taken from the sibling trials that each critic was not trained under (B). The trained critics generalize well to out-of-domain bodies, providing evidence that their understanding of the concept of locomotion extends beyond the specific bodies they saw during training.
  • ...and 1 more figures