Table of Contents
Fetching ...

Visuo-Tactile World Models

Carolina Higuera, Sergio Arnaud, Byron Boots, Mustafa Mukadam, Francois Robert Hogan, Franziska Meier

TL;DR

The paper tackles the limitations of vision-only world models in contact-rich manipulation by introducing a visuo-tactile world model (VT-WM) that grounds contact through tactile sensing. It combines RGB latent representations from a Cosmos encoder with tactile embeddings from Sparsh-X on Digit 360 sensors and uses a 12-layer transformer with spatio-temporal self-attention and cross-attention to actions to predict next latents $(s_{k+1}, t_{k+1})$. Training employs a mix of teacher forcing and sampling losses within a context of $9$ frames, and planning is performed with the cross-entropy method (CEM) in imagination, enabling zero-shot planning. Results show up to $33\%$ improvement in object permanence, $29\%$ improvement in causal compliance, and up to $35\%$ higher real-robot planning success in contact-rich tasks, plus $77\%$ success on a novel plate-in-rack task with only 20 demonstrations, highlighting practical impact for robust manipulation.

Abstract

We introduce multi-task Visuo-Tactile World Models (VT-WM), which capture the physics of contact through touch reasoning. By complementing vision with tactile sensing, VT-WM better understands robot-object interactions in contact-rich tasks, avoiding common failure modes of vision-only models under occlusion or ambiguous contact states, such as objects disappearing, teleporting, or moving in ways that violate basic physics. Trained across a set of contact-rich manipulation tasks, VT-WM improves physical fidelity in imagination, achieving 33% better performance at maintaining object permanence and 29% better compliance with the laws of motion in autoregressive rollouts. Moreover, experiments show that grounding in contact dynamics also translates to planning. In zero-shot real-robot experiments, VT-WM achieves up to 35% higher success rates, with the largest gains in multi-step, contact-rich tasks. Finally, VT-WM demonstrates significant downstream versatility, effectively adapting its learned contact dynamics to a novel task and achieving reliable planning success with only a limited set of demonstrations.

Visuo-Tactile World Models

TL;DR

The paper tackles the limitations of vision-only world models in contact-rich manipulation by introducing a visuo-tactile world model (VT-WM) that grounds contact through tactile sensing. It combines RGB latent representations from a Cosmos encoder with tactile embeddings from Sparsh-X on Digit 360 sensors and uses a 12-layer transformer with spatio-temporal self-attention and cross-attention to actions to predict next latents . Training employs a mix of teacher forcing and sampling losses within a context of frames, and planning is performed with the cross-entropy method (CEM) in imagination, enabling zero-shot planning. Results show up to improvement in object permanence, improvement in causal compliance, and up to higher real-robot planning success in contact-rich tasks, plus success on a novel plate-in-rack task with only 20 demonstrations, highlighting practical impact for robust manipulation.

Abstract

We introduce multi-task Visuo-Tactile World Models (VT-WM), which capture the physics of contact through touch reasoning. By complementing vision with tactile sensing, VT-WM better understands robot-object interactions in contact-rich tasks, avoiding common failure modes of vision-only models under occlusion or ambiguous contact states, such as objects disappearing, teleporting, or moving in ways that violate basic physics. Trained across a set of contact-rich manipulation tasks, VT-WM improves physical fidelity in imagination, achieving 33% better performance at maintaining object permanence and 29% better compliance with the laws of motion in autoregressive rollouts. Moreover, experiments show that grounding in contact dynamics also translates to planning. In zero-shot real-robot experiments, VT-WM achieves up to 35% higher success rates, with the largest gains in multi-step, contact-rich tasks. Finally, VT-WM demonstrates significant downstream versatility, effectively adapting its learned contact dynamics to a novel task and achieving reliable planning success with only a limited set of demonstrations.
Paper Structure (34 sections, 2 equations, 17 figures, 1 algorithm)

This paper contains 34 sections, 2 equations, 17 figures, 1 algorithm.

Figures (17)

  • Figure 1: Visuo-Tactile World Model complements vision with touch, providing contact grounding of robot-object interactions. Notice that when using the WMs for planning a cube stacking task, the VT-WM has notion of object permanence of the blue cube when transporting, placing and releasing the object. The contact grounding provided by the vision-based tactile sensor helps to reduce hallucinations often present in V-WMs, enabling more reliable zero-shot planning in contact-rich manipulation tasks.
  • Figure 2: Tactile images from Digit 360 sensors. White boxes highlight contact while the hand holds a screw.
  • Figure 3: Visuo-Tactile World Model. Vision ($s_k$) and tactile ($t_k$) latents, obtained from Cosmos and Sparsh encoders, are processed by a transformer predictor given control actions $a_k$ to generate next-step states $(s_{k+1}, t_{k+1})$.
  • Figure 4: Object permanence. VT-WM achieves an average reduction of $\approx33\%$ relative to V-WM (with 95% CI) of the normalized Fréchet distances for objects in motion.
  • Figure 5: Comparison of rollouts, illustrating that VT-WM prevents spurious motion of objects not subject to forces, whereas V-WM often hallucinates unintended displacements.
  • ...and 12 more figures