Automatic Environment Shaping is the Next Frontier in RL

Younghyo Park; Gabriel B. Margolis; Pulkit Agrawal

Automatic Environment Shaping is the Next Frontier in RL

Younghyo Park, Gabriel B. Margolis, Pulkit Agrawal

TL;DR

The paper addresses the bottleneck of manual environment shaping in robotics RL and argues for automatic environment shaping as a path to generalization. It formalizes environment shaping as a bilevel optimization where an outer shaping function f transforms a reference environment into a learnable E^shaped and an inner RL optimization finds a policy π that maximizes rewards on E^shaped, with the outer objective evaluated on a test environment E^test: max_{f∈F} J(π^*, E^test) subject to π^* ∈ argmax_π J(π; E^shaped), where E^shaped = f(E^ref). The paper details a four-subtask workflow (modeling sample environments, shaping, RL training, evaluation/reflection), analyzes current state showing rewards-focused automation is insufficient, and proposes paths forward including scalable outer-loop search, better priors, online shaping, and unshaped robotics benchmarks. It advocates concrete tooling for shaping experiments and benchmarks to measure the total cost of applying RL to real-world tasks, aiming to reduce human effort and improve robustness across tasks and domains.

Abstract

Many roboticists dream of presenting a robot with a task in the evening and returning the next morning to find the robot capable of solving the task. What is preventing us from achieving this? Sim-to-real reinforcement learning (RL) has achieved impressive performance on challenging robotics tasks, but requires substantial human effort to set up the task in a way that is amenable to RL. It's our position that algorithmic improvements in policy optimization and other ideas should be guided towards resolving the primary bottleneck of shaping the training environment, i.e., designing observations, actions, rewards and simulation dynamics. Most practitioners don't tune the RL algorithm, but other environment parameters to obtain a desirable controller. We posit that scaling RL to diverse robotic tasks will only be achieved if the community focuses on automating environment shaping procedures.

Automatic Environment Shaping is the Next Frontier in RL

TL;DR

Abstract

Paper Structure (12 sections, 8 equations, 5 figures, 2 tables)

This paper contains 12 sections, 8 equations, 5 figures, 2 tables.

Introduction
Robotic Behavior Generation with RL
Modeling Sample Environments
Shaping Reference Environments
RL Training
Optimizing Environment Shaping via Iterative Behavior Evaluation and Reflection
The Current State of Environment Shaping
RL Benchmarks for Robotics are Artificially Easy
Shaping the Entire Environment is Harder than Shaping One Component
Existing Automation Focuses Narrowly on Rewards
Paths Forward to Automated Environment Shaping
Conclusion

Figures (5)

Figure 1: Flowchart of a typical behavior generation pipeline using reinforcement learning with simulation, illustrating four distinct subtasks of sample environment modeling, environment shaping, RL training, and outer feedback loop with behavior evaluation and reflection. We highlight the manual, task-driven environment shaping as a key, yet often overlooked, bottleneck in generalizing the success of RL. We thus advocate for automating the environment shaping process to broaden RL's applicability.
Figure 2: Example of environment complexity: an overloaded and disorganized real-world dishwasher.
Figure 3: Action space shaping: (Top) Original shaped action space with task-specific features. (Bottom) Unshaped action space consisting of joint torque commands. Some shaped code has been slightly modified from the source to increase brevity and clarity while preserving the original logic.
Figure 4: State space shaping: (Top) Original shaped state space with task-specific features. (Bottom) Unshaped state space contains the entire raw simulator state.
Figure 5: Local optima in environment shaping problems. Each node represents a shaped training environment. Edges connect environments that are separated by modifying one type of shaping (action space, state space, reward function, initial state, goal, or terminal condition). Bold arrows represent optimal choices for hill climbing. Each environment is shown to have multiple local optima corresponding to the top row of nodes.

Theorems & Definitions (3)

Definition 2.1: Test Environment
Definition 2.2: Reference Environment
Definition 2.3: Shaped Environment

Automatic Environment Shaping is the Next Frontier in RL

TL;DR

Abstract

Automatic Environment Shaping is the Next Frontier in RL

Authors

TL;DR

Abstract

Table of Contents

Figures (5)

Theorems & Definitions (3)