Towards Interpretable Visuo-Tactile Predictive Models for Soft Robot Interactions
Enrico Donato, Thomas George Thuruthel, Egidio Falotico
TL;DR
This work tackles the challenge of interpretable visuo-tactile prediction for soft robot interactions by building a generative, multi-modal world model using modality-specific encoders/decoders within a Conditional Variational Autoencoder. It enables late fusion of proprioceptive, tactile, and visual information and action-conditioned prediction of the next state $\hat{s}_{t+1}$, while providing tools to interpret latent representations through latent-space visualization and analysis of generative properties. Empirical results on simulated soft-finger interactions show that including vision supports proprioception and enhances force prediction, with conditioned latent spaces revealing structured organization and cross-modal dependencies. The study demonstrates practical interpretability gains and outlines steps toward deploying these world models for control in soft robotics, with future extensions to actuation and feedback-aware policies.
Abstract
Autonomous systems face the intricate challenge of navigating unpredictable environments and interacting with external objects. The successful integration of robotic agents into real-world situations hinges on their perception capabilities, which involve amalgamating world models and predictive skills. Effective perception models build upon the fusion of various sensory modalities to probe the surroundings. Deep learning applied to raw sensory modalities offers a viable option. However, learning-based perceptive representations become difficult to interpret. This challenge is particularly pronounced in soft robots, where the compliance of structures and materials makes prediction even harder. Our work addresses this complexity by harnessing a generative model to construct a multi-modal perception model for soft robots and to leverage proprioceptive and visual information to anticipate and interpret contact interactions with external objects. A suite of tools to interpret the perception model is furnished, shedding light on the fusion and prediction processes across multiple sensory inputs after the learning phase. We will delve into the outlooks of the perception model and its implications for control purposes.
