Table of Contents
Fetching ...

Robust Policies via Mid-Level Visual Representations: An Experimental Study in Manipulation and Navigation

Bryan Chen, Alexander Sax, Gene Lewis, Iro Armeni, Silvio Savarese, Amir Zamir, Jitendra Malik, Lerrel Pinto

TL;DR

This paper demonstrates that asynchronously trained mid-level visual representations, when frozen as perceptual inputs to reinforcement learning agents, substantially improve generalization and sample efficiency over end-to-end pixel policies. By evaluating on manipulation and navigation tasks, including zero-shot sim-to-real transfer, the study shows that mid-level features scale to harder problems and are more robust to domain shifts than domain randomization or learning-from-scratch. The findings indicate that mid-level representations align training and test distributions, simplify the learning problem, and support faster, more reliable policy learning with real-world applicability. Overall, mid-level vision offers a practical and scalable path to robust visuomotor control in robotics.

Abstract

Vision-based robotics often separates the control loop into one module for perception and a separate module for control. It is possible to train the whole system end-to-end (e.g. with deep RL), but doing it "from scratch" comes with a high sample complexity cost and the final result is often brittle, failing unexpectedly if the test environment differs from that of training. We study the effects of using mid-level visual representations (features learned asynchronously for traditional computer vision objectives), as a generic and easy-to-decode perceptual state in an end-to-end RL framework. Mid-level representations encode invariances about the world, and we show that they aid generalization, improve sample complexity, and lead to a higher final performance. Compared to other approaches for incorporating invariances, such as domain randomization, asynchronously trained mid-level representations scale better: both to harder problems and to larger domain shifts. In practice, this means that mid-level representations could be used to successfully train policies for tasks where domain randomization and learning-from-scratch failed. We report results on both manipulation and navigation tasks, and for navigation include zero-shot sim-to-real experiments on real robots.

Robust Policies via Mid-Level Visual Representations: An Experimental Study in Manipulation and Navigation

TL;DR

This paper demonstrates that asynchronously trained mid-level visual representations, when frozen as perceptual inputs to reinforcement learning agents, substantially improve generalization and sample efficiency over end-to-end pixel policies. By evaluating on manipulation and navigation tasks, including zero-shot sim-to-real transfer, the study shows that mid-level features scale to harder problems and are more robust to domain shifts than domain randomization or learning-from-scratch. The findings indicate that mid-level representations align training and test distributions, simplify the learning problem, and support faster, more reliable policy learning with real-world applicability. Overall, mid-level vision offers a practical and scalable path to robust visuomotor control in robotics.

Abstract

Vision-based robotics often separates the control loop into one module for perception and a separate module for control. It is possible to train the whole system end-to-end (e.g. with deep RL), but doing it "from scratch" comes with a high sample complexity cost and the final result is often brittle, failing unexpectedly if the test environment differs from that of training. We study the effects of using mid-level visual representations (features learned asynchronously for traditional computer vision objectives), as a generic and easy-to-decode perceptual state in an end-to-end RL framework. Mid-level representations encode invariances about the world, and we show that they aid generalization, improve sample complexity, and lead to a higher final performance. Compared to other approaches for incorporating invariances, such as domain randomization, asynchronously trained mid-level representations scale better: both to harder problems and to larger domain shifts. In practice, this means that mid-level representations could be used to successfully train policies for tasks where domain randomization and learning-from-scratch failed. We report results on both manipulation and navigation tasks, and for navigation include zero-shot sim-to-real experiments on real robots.

Paper Structure

This paper contains 14 sections, 9 figures, 1 table.

Figures (9)

  • Figure 1: Mid-level visual representations used for RL.Left: A feature encoder trained for some mid-level objective provides representations to the agent. Right: Agents trained using these mid-level representations were able to generalize, without additional adaptation, to distribution shifts and deployment on physical robots.
  • Figure 2: Mid-level representations transform pixel inputs. There are two major ways that invariances from mid-level representations could be useful for downstream tasks. (I) Left: The invariances simplify the decision boundaries for downstream tasks. In this case, we would expect (i) training on representations to be faster than training on pixels and (ii) to allow us to train agents for more difficult tasks. (II) Right: The invariances in representation space align the train and test distributions. In this case we would expect generalization performance to improve relative to training on pixels. In practice, we see behavior consistent with both hypotheses.
  • Figure 3: Sample labels for mid-level visual objectives in the RLBench environment. The objectives cover various modes of computer vision tasks including 2D, 3D, and semantic tasks.
  • Figure 4: Train vs. test observations in RLBench.Left: The default environment for the Pick + Place task. The goal is colored green, object in red. Right: Sample observation from the test environment showing held-out randomized textures.
  • Figure 5: Zero-Shot Visual Sim-to-Real.Right: We train policies in a single building in simulation. Middle: We then directly test them in novel real world buildings, where agents have no prior knowledge of the building and no adaption period. Left: The TurtleBot uses only an RGB camera for vision and an IMU for localization, No depth/LiDAR sensors are used.
  • ...and 4 more figures