Table of Contents
Fetching ...

Self-Improving Robots: End-to-End Autonomous Visuomotor Reinforcement Learning

Archit Sharma, Ahmed M. Ahmed, Rehaan Ahmad, Chelsea Finn

TL;DR

MEDAL++ enables autonomous visuomotor reinforcement learning from RGB vision with minimal supervision by learning a forward policy to perform tasks and a backward policy to undo them, while online inferring the reward via VICE. It improves data efficiency through an ensemble of Q-networks, demonstration oversampling, and BC-regularized policy updates, and it trains end-to-end without explicit state estimation. In both simulated EARL benchmarks and real-robot experiments on a Franka Panda, MEDAL++ achieves substantially higher success rates than behavior cloning, demonstrating a practical step toward self-improving robots, though data collection speed and reset strategies remain challenges for broader deployment.

Abstract

In imitation and reinforcement learning, the cost of human supervision limits the amount of data that robots can be trained on. An aspirational goal is to construct self-improving robots: robots that can learn and improve on their own, from autonomous interaction with minimal human supervision or oversight. Such robots could collect and train on much larger datasets, and thus learn more robust and performant policies. While reinforcement learning offers a framework for such autonomous learning via trial-and-error, practical realizations end up requiring extensive human supervision for reward function design and repeated resetting of the environment between episodes of interactions. In this work, we propose MEDAL++, a novel design for self-improving robotic systems: given a small set of expert demonstrations at the start, the robot autonomously practices the task by learning to both do and undo the task, simultaneously inferring the reward function from the demonstrations. The policy and reward function are learned end-to-end from high-dimensional visual inputs, bypassing the need for explicit state estimation or task-specific pre-training for visual encoders used in prior work. We first evaluate our proposed algorithm on a simulated non-episodic benchmark EARL, finding that MEDAL++ is both more data efficient and gets up to 30% better final performance compared to state-of-the-art vision-based methods. Our real-robot experiments show that MEDAL++ can be applied to manipulation problems in larger environments than those considered in prior work, and autonomous self-improvement can improve the success rate by 30-70% over behavior cloning on just the expert data. Code, training and evaluation videos along with a brief overview is available at: https://architsharma97.github.io/self-improving-robots/

Self-Improving Robots: End-to-End Autonomous Visuomotor Reinforcement Learning

TL;DR

MEDAL++ enables autonomous visuomotor reinforcement learning from RGB vision with minimal supervision by learning a forward policy to perform tasks and a backward policy to undo them, while online inferring the reward via VICE. It improves data efficiency through an ensemble of Q-networks, demonstration oversampling, and BC-regularized policy updates, and it trains end-to-end without explicit state estimation. In both simulated EARL benchmarks and real-robot experiments on a Franka Panda, MEDAL++ achieves substantially higher success rates than behavior cloning, demonstrating a practical step toward self-improving robots, though data collection speed and reset strategies remain challenges for broader deployment.

Abstract

In imitation and reinforcement learning, the cost of human supervision limits the amount of data that robots can be trained on. An aspirational goal is to construct self-improving robots: robots that can learn and improve on their own, from autonomous interaction with minimal human supervision or oversight. Such robots could collect and train on much larger datasets, and thus learn more robust and performant policies. While reinforcement learning offers a framework for such autonomous learning via trial-and-error, practical realizations end up requiring extensive human supervision for reward function design and repeated resetting of the environment between episodes of interactions. In this work, we propose MEDAL++, a novel design for self-improving robotic systems: given a small set of expert demonstrations at the start, the robot autonomously practices the task by learning to both do and undo the task, simultaneously inferring the reward function from the demonstrations. The policy and reward function are learned end-to-end from high-dimensional visual inputs, bypassing the need for explicit state estimation or task-specific pre-training for visual encoders used in prior work. We first evaluate our proposed algorithm on a simulated non-episodic benchmark EARL, finding that MEDAL++ is both more data efficient and gets up to 30% better final performance compared to state-of-the-art vision-based methods. Our real-robot experiments show that MEDAL++ can be applied to manipulation problems in larger environments than those considered in prior work, and autonomous self-improvement can improve the success rate by 30-70% over behavior cloning on just the expert data. Code, training and evaluation videos along with a brief overview is available at: https://architsharma97.github.io/self-improving-robots/
Paper Structure (15 sections, 3 equations, 10 figures)

This paper contains 15 sections, 3 equations, 10 figures.

Figures (10)

  • Figure 1: A robot resets the environment from the goal state to the initial state (top), in contrast to a human resetting the environment for the robot (bottom). While latter is the norm in robotic reinforcement learning, a robot that can reset the environment and practice the task autonomously can train on more data, and thus, be more competent.
  • Figure 2: Visualizing the positive target states for forward classifier $C_f$ and backward classifier $C_b$ from the expert demonstrations. For forward demonstrations, last $K$ states are used for $C_f$ (orange) and the rest are used for $C_b$ (pink). For backward demonstrations, last $K$ states are used for $C_b$.
  • Figure 3: An overview of MEDAL++ training. The classifier is trained to discriminate states visited by an expert from the states visited online. The robot reinforcement learns on a combination of self-collected and expert transitions, and the policy learning is regularized using the behavior cloning loss.
  • Figure 4: Comparison of autonomous RL methods on vision-based manipulation tasks in simulated environments from EARL sharma-earl. MEDAL++ is both more efficient and learns a similarly or more successful policy compared to other methods.
  • Figure 5: An overview of MEDAL++ on the task of inserting the peg into the goal location. (top) Starting with a set of expert trajectories, MEDAL++ learns a forward policy to insert the peg by matching the goal states and a backward policy to remove and randomize the peg position by matching the rest of the states visited by an expert. (bottom) Chaining the rollouts of forward and backward policies allows the robot to practice the task autonomously. The rewards indicate the similarity to their respective target states, output by a discriminator trained to classify online states from expert states.
  • ...and 5 more figures