SE3-Pose-Nets: Structured Deep Dynamics Models for Visuomotor Planning and Control
Arunkumar Byravan, Felix Leeb, Franziska Meier, Dieter Fox
TL;DR
SE3-Pose-Nets introduce a structured deep dynamics framework that decomposes scenes into moving parts with 6D poses and learns to predict part-wise pose changes under actions. The model enables end-to-end training with minimal supervision and supports closed-loop control by planning directly in the learned pose space, achieving real-time reactive control on a Baxter robot from raw depth data. Key contributions include explicit data association through a pose space, a three-part network architecture (structure, dynamics, transform), and gradient-based control in pose space, demonstrated in both simulation and real-world Baxter experiments. The work advances visuomotor control by combining structured scene understanding with model-based planning in a learned latent space, offering efficient control and robust data association without external trackers.
Abstract
In this work, we present an approach to deep visuomotor control using structured deep dynamics models. Our deep dynamics model, a variant of SE3-Nets, learns a low-dimensional pose embedding for visuomotor control via an encoder-decoder structure. Unlike prior work, our dynamics model is structured: given an input scene, our network explicitly learns to segment salient parts and predict their pose-embedding along with their motion modeled as a change in the pose space due to the applied actions. We train our model using a pair of point clouds separated by an action and show that given supervision only in the form of point-wise data associations between the frames our network is able to learn a meaningful segmentation of the scene along with consistent poses. We further show that our model can be used for closed-loop control directly in the learned low-dimensional pose space, where the actions are computed by minimizing error in the pose space using gradient-based methods, similar to traditional model-based control. We present results on controlling a Baxter robot from raw depth data in simulation and in the real world and compare against two baseline deep networks. Our method runs in real-time, achieves good prediction of scene dynamics and outperforms the baseline methods on multiple control runs. Video results can be found at: https://rse-lab.cs.washington.edu/se3-structured-deep-ctrl/
