Visuo-Tactile World Models

Carolina Higuera; Sergio Arnaud; Byron Boots; Mustafa Mukadam; Francois Robert Hogan; Franziska Meier

Visuo-Tactile World Models

Carolina Higuera, Sergio Arnaud, Byron Boots, Mustafa Mukadam, Francois Robert Hogan, Franziska Meier

TL;DR

The paper tackles the limitations of vision-only world models in contact-rich manipulation by introducing a visuo-tactile world model (VT-WM) that grounds contact through tactile sensing. It combines RGB latent representations from a Cosmos encoder with tactile embeddings from Sparsh-X on Digit 360 sensors and uses a 12-layer transformer with spatio-temporal self-attention and cross-attention to actions to predict next latents $(s_{k+1}, t_{k+1})$. Training employs a mix of teacher forcing and sampling losses within a context of $9$ frames, and planning is performed with the cross-entropy method (CEM) in imagination, enabling zero-shot planning. Results show up to $33\%$ improvement in object permanence, $29\%$ improvement in causal compliance, and up to $35\%$ higher real-robot planning success in contact-rich tasks, plus $77\%$ success on a novel plate-in-rack task with only 20 demonstrations, highlighting practical impact for robust manipulation.

Abstract

We introduce multi-task Visuo-Tactile World Models (VT-WM), which capture the physics of contact through touch reasoning. By complementing vision with tactile sensing, VT-WM better understands robot-object interactions in contact-rich tasks, avoiding common failure modes of vision-only models under occlusion or ambiguous contact states, such as objects disappearing, teleporting, or moving in ways that violate basic physics. Trained across a set of contact-rich manipulation tasks, VT-WM improves physical fidelity in imagination, achieving 33% better performance at maintaining object permanence and 29% better compliance with the laws of motion in autoregressive rollouts. Moreover, experiments show that grounding in contact dynamics also translates to planning. In zero-shot real-robot experiments, VT-WM achieves up to 35% higher success rates, with the largest gains in multi-step, contact-rich tasks. Finally, VT-WM demonstrates significant downstream versatility, effectively adapting its learned contact dynamics to a novel task and achieving reliable planning success with only a limited set of demonstrations.

Visuo-Tactile World Models

TL;DR

. Training employs a mix of teacher forcing and sampling losses within a context of

frames, and planning is performed with the cross-entropy method (CEM) in imagination, enabling zero-shot planning. Results show up to

improvement in object permanence,

improvement in causal compliance, and up to

higher real-robot planning success in contact-rich tasks, plus

success on a novel plate-in-rack task with only 20 demonstrations, highlighting practical impact for robust manipulation.

Abstract

Paper Structure (34 sections, 2 equations, 17 figures, 1 algorithm)

This paper contains 34 sections, 2 equations, 17 figures, 1 algorithm.

Introduction
Related Works
Foundational encoders for vision and touch:
Action-Conditioned World Models for Real World Robotics:
World Models that Understand Contact
What vision doesn't see: Sensing contact with touch
MultiTask Visuo-Tactile World Model
Model Architecture
Spatio-Temporal Self-Attention
Action Conditioning via Cross-Attention
Training visuo-tactile world model
Planning in Imagination
Experiments
Contact Perception
Object Permanence:
...and 19 more sections

Figures (17)

Figure 1: Visuo-Tactile World Model complements vision with touch, providing contact grounding of robot-object interactions. Notice that when using the WMs for planning a cube stacking task, the VT-WM has notion of object permanence of the blue cube when transporting, placing and releasing the object. The contact grounding provided by the vision-based tactile sensor helps to reduce hallucinations often present in V-WMs, enabling more reliable zero-shot planning in contact-rich manipulation tasks.
Figure 2: Tactile images from Digit 360 sensors. White boxes highlight contact while the hand holds a screw.
Figure 3: Visuo-Tactile World Model. Vision ($s_k$) and tactile ($t_k$) latents, obtained from Cosmos and Sparsh encoders, are processed by a transformer predictor given control actions $a_k$ to generate next-step states $(s_{k+1}, t_{k+1})$.
Figure 4: Object permanence. VT-WM achieves an average reduction of $\approx33\%$ relative to V-WM (with 95% CI) of the normalized Fréchet distances for objects in motion.
Figure 5: Comparison of rollouts, illustrating that VT-WM prevents spurious motion of objects not subject to forces, whereas V-WM often hallucinates unintended displacements.
...and 12 more figures

Visuo-Tactile World Models

TL;DR

Abstract

Visuo-Tactile World Models

Authors

TL;DR

Abstract

Table of Contents

Figures (17)