Visuo-Tactile World Models
Carolina Higuera, Sergio Arnaud, Byron Boots, Mustafa Mukadam, Francois Robert Hogan, Franziska Meier
TL;DR
The paper tackles the limitations of vision-only world models in contact-rich manipulation by introducing a visuo-tactile world model (VT-WM) that grounds contact through tactile sensing. It combines RGB latent representations from a Cosmos encoder with tactile embeddings from Sparsh-X on Digit 360 sensors and uses a 12-layer transformer with spatio-temporal self-attention and cross-attention to actions to predict next latents $(s_{k+1}, t_{k+1})$. Training employs a mix of teacher forcing and sampling losses within a context of $9$ frames, and planning is performed with the cross-entropy method (CEM) in imagination, enabling zero-shot planning. Results show up to $33\%$ improvement in object permanence, $29\%$ improvement in causal compliance, and up to $35\%$ higher real-robot planning success in contact-rich tasks, plus $77\%$ success on a novel plate-in-rack task with only 20 demonstrations, highlighting practical impact for robust manipulation.
Abstract
We introduce multi-task Visuo-Tactile World Models (VT-WM), which capture the physics of contact through touch reasoning. By complementing vision with tactile sensing, VT-WM better understands robot-object interactions in contact-rich tasks, avoiding common failure modes of vision-only models under occlusion or ambiguous contact states, such as objects disappearing, teleporting, or moving in ways that violate basic physics. Trained across a set of contact-rich manipulation tasks, VT-WM improves physical fidelity in imagination, achieving 33% better performance at maintaining object permanence and 29% better compliance with the laws of motion in autoregressive rollouts. Moreover, experiments show that grounding in contact dynamics also translates to planning. In zero-shot real-robot experiments, VT-WM achieves up to 35% higher success rates, with the largest gains in multi-step, contact-rich tasks. Finally, VT-WM demonstrates significant downstream versatility, effectively adapting its learned contact dynamics to a novel task and achieving reliable planning success with only a limited set of demonstrations.
