Table of Contents
Fetching ...

Deep RL at Scale: Sorting Waste in Office Buildings with a Fleet of Mobile Manipulators

Alexander Herzog, Kanishka Rao, Karol Hausman, Yao Lu, Paul Wohlhart, Mengyuan Yan, Jessica Lin, Montserrat Gonzalez Arenas, Ted Xiao, Daniel Kappler, Daniel Ho, Jarek Rettinghouse, Yevgen Chebotar, Kuang-Huei Lee, Keerthana Gopalakrishnan, Ryan Julian, Adrian Li, Chuyuan Kelly Fu, Bob Wei, Sangeetha Ramesh, Khem Holden, Kim Kleiven, David Rendleman, Sean Kirmani, Jeff Bingham, Jon Weisz, Ying Xu, Wenlong Lu, Matthew Bennice, Cody Fong, David Do, Jessica Lam, Yunfei Bai, Benjie Holson, Michael Quinlan, Noah Brown, Mrinal Kalakrishnan, Julian Ibarz, Peter Pastor, Sergey Levine

TL;DR

The paper addresses scalable, real-world robotic manipulation via end-to-end deep RL for office waste sorting. It introduces RL@Scale (RLS), a framework that boots from simulation, learns from a fleet of real robots, and leverages pretrained object masks to improve generalization, all coordinated through a data-flywheel training loop using PI-QT-Opt. Empirical results from 23 robots across three buildings show that more real-world data and architectural choices (memory, masking, sim-to-real transfer) substantially improve in-distribution and held-out performance, achieving 84.35% sorting success and meaningful contamination reduction in deployment. The work provides a practical blueprint for deploying learning-enabled robotic manipulation at scale, including safety-aware autonomous data collection, multi-task curriculum bootstrapping, and integration of computer-vision priors with end-to-end control.

Abstract

We describe a system for deep reinforcement learning of robotic manipulation skills applied to a large-scale real-world task: sorting recyclables and trash in office buildings. Real-world deployment of deep RL policies requires not only effective training algorithms, but the ability to bootstrap real-world training and enable broad generalization. To this end, our system combines scalable deep RL from real-world data with bootstrapping from training in simulation, and incorporates auxiliary inputs from existing computer vision systems as a way to boost generalization to novel objects, while retaining the benefits of end-to-end training. We analyze the tradeoffs of different design decisions in our system, and present a large-scale empirical validation that includes training on real-world data gathered over the course of 24 months of experimentation, across a fleet of 23 robots in three office buildings, with a total training set of 9527 hours of robotic experience. Our final validation also consists of 4800 evaluation trials across 240 waste station configurations, in order to evaluate in detail the impact of the design decisions in our system, the scaling effects of including more real-world data, and the performance of the method on novel objects. The projects website and videos can be found at \href{http://rl-at-scale.github.io}{rl-at-scale.github.io}.

Deep RL at Scale: Sorting Waste in Office Buildings with a Fleet of Mobile Manipulators

TL;DR

The paper addresses scalable, real-world robotic manipulation via end-to-end deep RL for office waste sorting. It introduces RL@Scale (RLS), a framework that boots from simulation, learns from a fleet of real robots, and leverages pretrained object masks to improve generalization, all coordinated through a data-flywheel training loop using PI-QT-Opt. Empirical results from 23 robots across three buildings show that more real-world data and architectural choices (memory, masking, sim-to-real transfer) substantially improve in-distribution and held-out performance, achieving 84.35% sorting success and meaningful contamination reduction in deployment. The work provides a practical blueprint for deploying learning-enabled robotic manipulation at scale, including safety-aware autonomous data collection, multi-task curriculum bootstrapping, and integration of computer-vision priors with end-to-end control.

Abstract

We describe a system for deep reinforcement learning of robotic manipulation skills applied to a large-scale real-world task: sorting recyclables and trash in office buildings. Real-world deployment of deep RL policies requires not only effective training algorithms, but the ability to bootstrap real-world training and enable broad generalization. To this end, our system combines scalable deep RL from real-world data with bootstrapping from training in simulation, and incorporates auxiliary inputs from existing computer vision systems as a way to boost generalization to novel objects, while retaining the benefits of end-to-end training. We analyze the tradeoffs of different design decisions in our system, and present a large-scale empirical validation that includes training on real-world data gathered over the course of 24 months of experimentation, across a fleet of 23 robots in three office buildings, with a total training set of 9527 hours of robotic experience. Our final validation also consists of 4800 evaluation trials across 240 waste station configurations, in order to evaluate in detail the impact of the design decisions in our system, the scaling effects of including more real-world data, and the performance of the method on novel objects. The projects website and videos can be found at \href{http://rl-at-scale.github.io}{rl-at-scale.github.io}.
Paper Structure (30 sections, 1 equation, 14 figures, 1 table, 1 algorithm)

This paper contains 30 sections, 1 equation, 14 figures, 1 table, 1 algorithm.

Figures (14)

  • Figure 1: Overview of our data flywheels that we operated over 24 months: We bootstrap the initial policy from scripts in simulation and on real robots (grey), re-train the policy in simulation as needed (green), deploy the latest policy weekly to a local setup of 20 robots sorting 20 waste stations on random waste-scenes and scenes encountered in the deployment site (blue), and deploy to 23 robots operating in 3 different buildings sorting 30 waste stations (red).
  • Figure 2: The experimental platform. left: Our mobile manipulator with a 7 degree-of-freedom (DoF) arm and a parallel jaw gripper. right: The sorting task demonstrated by an example: A compostable food container (red box) is misplaced in the landfill tray. Once the robot arrives at its initial state in front of the waste station with the arm above the station, it executes a trained or scripted policy that identifies misplaced objects and moves them to the correct bin. In the case of this example, the robot would receive a reward for moving the food container into the compost tray (green box).
  • Figure 3: An overview of the network architecture. We encode RGB camera images and unsorted object masks convolutional layers. Our Q-function considers visual observations in the most recent 6 time steps, but the predictive information (CEB) auxiliary only considers the current image $o^{v}_{t}$ for the past $X$ and the next image $o^{v}_{t+1}$ for the future $Y$, in order to avoid information overlap between $X$ and $Y$.
  • Figure 4: Waste scenarios used for evaluations. Top 3 rows show the 9 in-distribution scenarios. The bottom row shows the held-out-scenes, containing objects previously seen neither in the real world nor in simulation, such as the keyboard, banana and face-mask.
  • Figure 5: The robot classroom, a controlled setting for repeatable evaluations. 20 robots continuously collect data at 20 waste stations.
  • ...and 9 more figures