Towards Interpretable Visuo-Tactile Predictive Models for Soft Robot Interactions

Enrico Donato; Thomas George Thuruthel; Egidio Falotico

Towards Interpretable Visuo-Tactile Predictive Models for Soft Robot Interactions

Enrico Donato, Thomas George Thuruthel, Egidio Falotico

TL;DR

This work tackles the challenge of interpretable visuo-tactile prediction for soft robot interactions by building a generative, multi-modal world model using modality-specific encoders/decoders within a Conditional Variational Autoencoder. It enables late fusion of proprioceptive, tactile, and visual information and action-conditioned prediction of the next state $\hat{s}_{t+1}$, while providing tools to interpret latent representations through latent-space visualization and analysis of generative properties. Empirical results on simulated soft-finger interactions show that including vision supports proprioception and enhances force prediction, with conditioned latent spaces revealing structured organization and cross-modal dependencies. The study demonstrates practical interpretability gains and outlines steps toward deploying these world models for control in soft robotics, with future extensions to actuation and feedback-aware policies.

Abstract

Autonomous systems face the intricate challenge of navigating unpredictable environments and interacting with external objects. The successful integration of robotic agents into real-world situations hinges on their perception capabilities, which involve amalgamating world models and predictive skills. Effective perception models build upon the fusion of various sensory modalities to probe the surroundings. Deep learning applied to raw sensory modalities offers a viable option. However, learning-based perceptive representations become difficult to interpret. This challenge is particularly pronounced in soft robots, where the compliance of structures and materials makes prediction even harder. Our work addresses this complexity by harnessing a generative model to construct a multi-modal perception model for soft robots and to leverage proprioceptive and visual information to anticipate and interpret contact interactions with external objects. A suite of tools to interpret the perception model is furnished, shedding light on the fusion and prediction processes across multiple sensory inputs after the learning phase. We will delve into the outlooks of the perception model and its implications for control purposes.

Towards Interpretable Visuo-Tactile Predictive Models for Soft Robot Interactions

TL;DR

, while providing tools to interpret latent representations through latent-space visualization and analysis of generative properties. Empirical results on simulated soft-finger interactions show that including vision supports proprioception and enhances force prediction, with conditioned latent spaces revealing structured organization and cross-modal dependencies. The study demonstrates practical interpretability gains and outlines steps toward deploying these world models for control in soft robotics, with future extensions to actuation and feedback-aware policies.

Abstract

Paper Structure (15 sections, 6 figures, 1 table)

This paper contains 15 sections, 6 figures, 1 table.

Introduction
Related works
Multi-modal sensory fusion in robotics
Interpretable multi-modal representation learning
World Model Generation
Simulation scenario
Perception model
Interpret the Sensory Representation
Latent space visualization
Generative model properties
Results
Perception model performance assessment
Latent space organization and visualization
Analysis of generative properties
Conclusion

Figures (6)

Figure 1: The soft finger interacts with the environment and multi-modal information is gathered and mapped in a shared latent representation. After being conditioned by the robot's future action, the latent representation is mapped back into the multi-modal sensory domain to get the predicted state.
Figure 2: The simulation of the robotic platform from donato2024perceptiongenerative involves incorporating a passive finger affixed to the distal portion of a rigid cylindrical robot. This finger interacts with the ground or potentially movable objects, exhibiting 20 DoFs and executing flexion/extension as well as adduction/abduction movements.
Figure 3: The perception model is implemented through a Conditional Variational AutoEncoder, with late fusion and early decoding stages thanks to modality-specific encoders and decoders. Fusion aims to map single-modality representations into the encoded latent space. Action conditioning enables the mapping of the information to the conditioned latent space, later used for sensory prediction.
Figure 4: Prediction of the perception model over different output modalities. (A) Proprioception and force prediction over different latent dimensions and input modalities. (B) Optical flow prediction over different latent dimensions.
Figure 5: Latent space visualization and analysis. (A) Encoded latent space and (B) conditioned latent space over different modalities and latent dimensions. (C) Relation between distance from cluster centroid and force.
...and 1 more figures

Towards Interpretable Visuo-Tactile Predictive Models for Soft Robot Interactions

TL;DR

Abstract

Towards Interpretable Visuo-Tactile Predictive Models for Soft Robot Interactions

Authors

TL;DR

Abstract

Table of Contents

Figures (6)