Table of Contents
Fetching ...

SE3-Pose-Nets: Structured Deep Dynamics Models for Visuomotor Planning and Control

Arunkumar Byravan, Felix Leeb, Franziska Meier, Dieter Fox

TL;DR

SE3-Pose-Nets introduce a structured deep dynamics framework that decomposes scenes into moving parts with 6D poses and learns to predict part-wise pose changes under actions. The model enables end-to-end training with minimal supervision and supports closed-loop control by planning directly in the learned pose space, achieving real-time reactive control on a Baxter robot from raw depth data. Key contributions include explicit data association through a pose space, a three-part network architecture (structure, dynamics, transform), and gradient-based control in pose space, demonstrated in both simulation and real-world Baxter experiments. The work advances visuomotor control by combining structured scene understanding with model-based planning in a learned latent space, offering efficient control and robust data association without external trackers.

Abstract

In this work, we present an approach to deep visuomotor control using structured deep dynamics models. Our deep dynamics model, a variant of SE3-Nets, learns a low-dimensional pose embedding for visuomotor control via an encoder-decoder structure. Unlike prior work, our dynamics model is structured: given an input scene, our network explicitly learns to segment salient parts and predict their pose-embedding along with their motion modeled as a change in the pose space due to the applied actions. We train our model using a pair of point clouds separated by an action and show that given supervision only in the form of point-wise data associations between the frames our network is able to learn a meaningful segmentation of the scene along with consistent poses. We further show that our model can be used for closed-loop control directly in the learned low-dimensional pose space, where the actions are computed by minimizing error in the pose space using gradient-based methods, similar to traditional model-based control. We present results on controlling a Baxter robot from raw depth data in simulation and in the real world and compare against two baseline deep networks. Our method runs in real-time, achieves good prediction of scene dynamics and outperforms the baseline methods on multiple control runs. Video results can be found at: https://rse-lab.cs.washington.edu/se3-structured-deep-ctrl/

SE3-Pose-Nets: Structured Deep Dynamics Models for Visuomotor Planning and Control

TL;DR

SE3-Pose-Nets introduce a structured deep dynamics framework that decomposes scenes into moving parts with 6D poses and learns to predict part-wise pose changes under actions. The model enables end-to-end training with minimal supervision and supports closed-loop control by planning directly in the learned pose space, achieving real-time reactive control on a Baxter robot from raw depth data. Key contributions include explicit data association through a pose space, a three-part network architecture (structure, dynamics, transform), and gradient-based control in pose space, demonstrated in both simulation and real-world Baxter experiments. The work advances visuomotor control by combining structured scene understanding with model-based planning in a learned latent space, offering efficient control and robust data association without external trackers.

Abstract

In this work, we present an approach to deep visuomotor control using structured deep dynamics models. Our deep dynamics model, a variant of SE3-Nets, learns a low-dimensional pose embedding for visuomotor control via an encoder-decoder structure. Unlike prior work, our dynamics model is structured: given an input scene, our network explicitly learns to segment salient parts and predict their pose-embedding along with their motion modeled as a change in the pose space due to the applied actions. We train our model using a pair of point clouds separated by an action and show that given supervision only in the form of point-wise data associations between the frames our network is able to learn a meaningful segmentation of the scene along with consistent poses. We further show that our model can be used for closed-loop control directly in the learned low-dimensional pose space, where the actions are computed by minimizing error in the pose space using gradient-based methods, similar to traditional model-based control. We present results on controlling a Baxter robot from raw depth data in simulation and in the real world and compare against two baseline deep networks. Our method runs in real-time, achieves good prediction of scene dynamics and outperforms the baseline methods on multiple control runs. Video results can be found at: https://rse-lab.cs.washington.edu/se3-structured-deep-ctrl/

Paper Structure

This paper contains 16 sections, 6 equations, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: An example scenario showing the initial (left) and target point cloud (right). SE3-Pose-Nets can be used to control the robot to reach the target state based only on raw depth data. Depth images colorized for display purposes only.
  • Figure 2: Top:SE3-Pose-Net architecture consisting of three components: the encoder ($h_{enc}$, shown in blue) that predicts dense segmentation masks ($\mathbf{m}$) and 6D poses ($\mathbf{p}$), a pose transition net ($h_{trans}$) that models the change in the pose space ($\Delta{\mathbf{p}}$) as an effect of the applied action ($\mathbf{u}$) and the transform layer that applies these pose changes to the current point cloud to generate a predicted point cloud ($\hat{\mathbf{x}}$). Bottom Left: Graph showing the procedure for training the SE3-Pose-Net along with two loss functions: a 3D loss on the predicted point cloud ($L_x$) and a pose consistency loss ($L_p$) relating the "next" poses predicted by the transform network ($\hat{\mathbf{p}}_{t+1}$) and the encoder ($\mathbf{p}_{t+1}$). Bottom Right: Control using the SE3-Pose-Net. Given a target point cloud ($\mathbf{x}_T$) encoded as poses ($\mathbf{p}_T$) through the learned encoder, we use the learned transition model ($h_{trans}$) to plan a sequence of actions $\mathbf{u}_0, \mathbf{u}_1,...,\mathbf{u}_T$ by minimizing error (E) directly in the pose space from an initial point cloud $\mathbf{p}_0$.
  • Figure 3: Masks generated by different networks on simulated (top) and real data (bottom). From left to right: Ground truth depth, ground truth masks, masks predicted by the SE3-Pose-Net, SE3-Pose-Net with joint angles, SE3-Net and SE3-Net with joint angles.
  • Figure 4: Convergence of joint angle error in simulated Baxter control tasks. (left): without joint angles, (middle) without joint angles and detected failure case removed (for all methods), (right) with joint angles. SE3-Pose-Nets perform as well or better than baseline methods even though baseline models have additional information in the form of ground truth-associations.
  • Figure 5: Convergence of joint angle error on real Baxter control tasks (left) without joint angles (right) with joint angles (averaged across joint 0,1,2,3).