Table of Contents
Fetching ...

Dense Object Nets: Learning Dense Visual Object Descriptors By and For Robotic Manipulation

Peter R. Florence, Lucas Manuelli, Russ Tedrake

TL;DR

Dense Object Nets address the need for a task-agnostic, dense object representation suitable for manipulation by learning pixelwise descriptors through self-supervision. The method combines a pixelwise contrastive loss with object-centric data practices, 3D change-detection masks, and three multi-object training strategies to produce descriptors that are consistent across viewpoint, deformation, and even object classes. Key contributions include rapid, self-supervised training for many objects, cross-object loss for distinct object separation, and demonstrations of grasping specific points across deformations and transferring grasps within a class. This work offers a scalable, practical approach to dense visual understanding in robotic manipulation with potential impact on general-purpose manipulation and object-centric learning.

Abstract

What is the right object representation for manipulation? We would like robots to visually perceive scenes and learn an understanding of the objects in them that (i) is task-agnostic and can be used as a building block for a variety of manipulation tasks, (ii) is generally applicable to both rigid and non-rigid objects, (iii) takes advantage of the strong priors provided by 3D vision, and (iv) is entirely learned from self-supervision. This is hard to achieve with previous methods: much recent work in grasping does not extend to grasping specific objects or other tasks, whereas task-specific learning may require many trials to generalize well across object configurations or other tasks. In this paper we present Dense Object Nets, which build on recent developments in self-supervised dense descriptor learning, as a consistent object representation for visual understanding and manipulation. We demonstrate they can be trained quickly (approximately 20 minutes) for a wide variety of previously unseen and potentially non-rigid objects. We additionally present novel contributions to enable multi-object descriptor learning, and show that by modifying our training procedure, we can either acquire descriptors which generalize across classes of objects, or descriptors that are distinct for each object instance. Finally, we demonstrate the novel application of learned dense descriptors to robotic manipulation. We demonstrate grasping of specific points on an object across potentially deformed object configurations, and demonstrate using class general descriptors to transfer specific grasps across objects in a class.

Dense Object Nets: Learning Dense Visual Object Descriptors By and For Robotic Manipulation

TL;DR

Dense Object Nets address the need for a task-agnostic, dense object representation suitable for manipulation by learning pixelwise descriptors through self-supervision. The method combines a pixelwise contrastive loss with object-centric data practices, 3D change-detection masks, and three multi-object training strategies to produce descriptors that are consistent across viewpoint, deformation, and even object classes. Key contributions include rapid, self-supervised training for many objects, cross-object loss for distinct object separation, and demonstrations of grasping specific points across deformations and transferring grasps within a class. This work offers a scalable, practical approach to dense visual understanding in robotic manipulation with potential impact on general-purpose manipulation and object-centric learning.

Abstract

What is the right object representation for manipulation? We would like robots to visually perceive scenes and learn an understanding of the objects in them that (i) is task-agnostic and can be used as a building block for a variety of manipulation tasks, (ii) is generally applicable to both rigid and non-rigid objects, (iii) takes advantage of the strong priors provided by 3D vision, and (iv) is entirely learned from self-supervision. This is hard to achieve with previous methods: much recent work in grasping does not extend to grasping specific objects or other tasks, whereas task-specific learning may require many trials to generalize well across object configurations or other tasks. In this paper we present Dense Object Nets, which build on recent developments in self-supervised dense descriptor learning, as a consistent object representation for visual understanding and manipulation. We demonstrate they can be trained quickly (approximately 20 minutes) for a wide variety of previously unseen and potentially non-rigid objects. We additionally present novel contributions to enable multi-object descriptor learning, and show that by modifying our training procedure, we can either acquire descriptors which generalize across classes of objects, or descriptors that are distinct for each object instance. Finally, we demonstrate the novel application of learned dense descriptors to robotic manipulation. We demonstrate grasping of specific points on an object across potentially deformed object configurations, and demonstrate using class general descriptors to transfer specific grasps across objects in a class.

Paper Structure

This paper contains 17 sections, 3 equations, 7 figures.

Figures (7)

  • Figure 1: Overview of the data collection and training procedure. (a) automated collection with a robot arm. (b) change detection using the dense 3D reconstruction. (c)-(f) matches depicted in green, non-matches depicted in red.
  • Figure 2: Learned object descriptors can be consistent across significant deformation (a) and, if desired, across object classes (b-d). Shown for each (a) and (b-d) are RGB frames (top) and corresponding descriptor images (bottom) that are the direct output of a feed-forward pass through a trained network. (e)-(f) shows that we can learn descriptors for low texture objects, with the descriptors masked for clear visualization. Our object set is also summarized (right).
  • Figure 3: (a) table describing the different types of networks referenced in experiments. Column labels correspond to techniques described in Section \ref{['sec:methodology']}. (a) Plots the cdf of the L2 pixel distance (normalized by image diagonal, 800 for a 640 x 480 image) between the best match $\hat{u}_b$ and the true match $u_b^*$, e.g. for standard-SO in $93\%$ of image pairs the normalized pixel distance between $u_b^*$ and $\hat{u}_b$ is less than $13\%$. All networks were trained on the same dataset using the labeled training procedure from (a). (c) Plots the cdf of the fraction of pixels $u_b$ of the object pixels with $D(I_a, u_a^*, I_b, u_b) < D(I_a, u_a^*, I_b, u_b^*)$, i.e. they are closer in descriptor space to $u_a^*$ than the true match $u_b^*$.
  • Figure 4: Comparison of training without any distinct object loss (a) vs. using cross-object loss (b). In (b), 50% of training iterations applied cross-object loss and 50% applied single-object within-scene loss, whereas (a) is 100% single-object within-scene loss. The plots show a scatter of the descriptors for 10,000 randomly-selected pixels for each of three distinct objects. Networks were trained with $D=2$ to allow direct cluster visualization. (c) Same axes as Figure \ref{['fig:network_types']} (a). All networks were trained on the same 3 object dataset. Networks with a number label were trained with cross object loss and the number denotes the descriptor dimension. no-cross-object is a network trained without cross object loss.
  • Figure 5: (a), with same axes as Figure \ref{['fig:network_types']}a, compares standard-SO with without-DR, for which the only difference is that without-DR used no background domain randomization during training. The dataset used for (a) is of three objects, 4 scenes each. (b) shows that for a dataset containing 10 scenes of a drill, learned descriptors are inconsistent without background and orientation randomization during training (middle), but consistent with them (right).
  • ...and 2 more figures