Table of Contents
Fetching ...

RoboPack: Learning Tactile-Informed Dynamics Models for Dense Packing

Bo Ai, Stephen Tian, Haochen Shi, Yixuan Wang, Cheston Tan, Yunzhu Li, Jiajun Wu

TL;DR

RoboPack addresses dense packing and non-prehensile manipulation under partial observability by learning tactile-informed dynamics from visuo-tactile history. It combines a tactile-aware state estimator with a graph-based dynamics predictor and model-predictive control, using a latent physics vector per object to capture properties not directly observable by vision. Trained on real-world data in under $30$ minutes per task, RoboPack significantly outperforms vision-only and physics-based baselines in long-horizon prediction and real-robot planning, and shows meaningful latent physics structure that correlates with object properties. The approach advances practical manipulation by enabling online adaptation to unknown object properties and occluded scenes through multi-modal perception and learned dynamics.

Abstract

Tactile feedback is critical for understanding the dynamics of both rigid and deformable objects in many manipulation tasks, such as non-prehensile manipulation and dense packing. We introduce an approach that combines visual and tactile sensing for robotic manipulation by learning a neural, tactile-informed dynamics model. Our proposed framework, RoboPack, employs a recurrent graph neural network to estimate object states, including particles and object-level latent physics information, from historical visuo-tactile observations and to perform future state predictions. Our tactile-informed dynamics model, learned from real-world data, can solve downstream robotics tasks with model-predictive control. We demonstrate our approach on a real robot equipped with a compliant Soft-Bubble tactile sensor on non-prehensile manipulation and dense packing tasks, where the robot must infer the physics properties of objects from direct and indirect interactions. Trained on only an average of 30 minutes of real-world interaction data per task, our model can perform online adaptation and make touch-informed predictions. Through extensive evaluations in both long-horizon dynamics prediction and real-world manipulation, our method demonstrates superior effectiveness compared to previous learning-based and physics-based simulation systems.

RoboPack: Learning Tactile-Informed Dynamics Models for Dense Packing

TL;DR

RoboPack addresses dense packing and non-prehensile manipulation under partial observability by learning tactile-informed dynamics from visuo-tactile history. It combines a tactile-aware state estimator with a graph-based dynamics predictor and model-predictive control, using a latent physics vector per object to capture properties not directly observable by vision. Trained on real-world data in under minutes per task, RoboPack significantly outperforms vision-only and physics-based baselines in long-horizon prediction and real-robot planning, and shows meaningful latent physics structure that correlates with object properties. The approach advances practical manipulation by enabling online adaptation to unknown object properties and occluded scenes through multi-modal perception and learned dynamics.

Abstract

Tactile feedback is critical for understanding the dynamics of both rigid and deformable objects in many manipulation tasks, such as non-prehensile manipulation and dense packing. We introduce an approach that combines visual and tactile sensing for robotic manipulation by learning a neural, tactile-informed dynamics model. Our proposed framework, RoboPack, employs a recurrent graph neural network to estimate object states, including particles and object-level latent physics information, from historical visuo-tactile observations and to perform future state predictions. Our tactile-informed dynamics model, learned from real-world data, can solve downstream robotics tasks with model-predictive control. We demonstrate our approach on a real robot equipped with a compliant Soft-Bubble tactile sensor on non-prehensile manipulation and dense packing tasks, where the robot must infer the physics properties of objects from direct and indirect interactions. Trained on only an average of 30 minutes of real-world interaction data per task, our model can perform online adaptation and make touch-informed predictions. Through extensive evaluations in both long-horizon dynamics prediction and real-world manipulation, our method demonstrates superior effectiveness compared to previous learning-based and physics-based simulation systems.
Paper Structure (43 sections, 12 equations, 12 figures, 8 tables)

This paper contains 43 sections, 12 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Tactile sensing for dense packing. Tactile feedback is critical in tasks with heavy occlusion and rich contact, such as dense packing. (a) Humans rely on tactile sensations from their hands to navigate space and fit a water bottle into a suitcase. (b) Likewise, tactile sensing is crucial for robots to perform dense packing tasks, such as placing a can into a packed tray.
  • Figure 2: RoboPack's perception module. (a) We construct a trajectory comprising particle representations of the scene, maintaining correspondence via 3D point tracking on the point cloud data. (b) These particles facilitate the creation of a visual scene representation, denoted as $o^{vis}_t$. For points representing the Soft-Bubble grippers, tactile encodings $o^{tact}_t$ and latent physics vectors are integrated as extra attributes of the particles. We note that while the 3D point tracking module is needed at training time, during deployment the visual feedback can be replaced by predictions from our state estimator. This estimator auto-regressively predicts object particle positions from tactile interaction history and reduces reliance on dense visual feedback, which can be difficult to obtain due to visual occlusions.
  • Figure 3: RoboPack's dynamics module. We perform state estimation and dynamics reasoning with a state estimator and a dynamics predictor respectively. (a) The state estimator auto-regressively predicts the positions of objects' particles and their latent physics vectors, reducing the dependency on dense visual feedback. (b) The dynamics predictor, conditioned on the estimated physics vectors, performs future prediction for planning. These modules share the same architecture, except that the state estimator has an LSTM that integrates history information and predicts physics parameters for each object.
  • Figure 4: Hardware overview. Our experimental platform consists of a Franka Panda arm, two Soft-Bubble sensors, four RealSense D415 RGB-D cameras, and a diverse set of objects.
  • Figure 5: Object sets for the packing task.The test objects are more complex than the training set visually, geometrically, and physically, to showcase the generalizability of our model.
  • ...and 7 more figures