Table of Contents
Fetching ...

Multi-modal perception for soft robotic interactions using generative models

Enrico Donato, Egidio Falotico, Thomas George Thuruthel

TL;DR

The paper tackles building a compact, multi-modal state representation for soft robots by introducing an action-conditioned generative model that fuses proprioception, vision, and touch. It employs a Conditional Variational Autoencoder to encode fused inputs into a latent state and predict the next observation $\hat{s}_{t+1}$ conditioned on action $a_t$, with an auxiliary reconstruction path to assess information preservation. Experiments in a SoMo-based soft finger simulation compare single- and multi-modality predictions, examining how latent dimension and fusion affect predictive accuracy and reconstruction quality, and highlighting that vision enhances touch prediction and force estimation. This framework offers a principled route to perceptually aware soft-robot control and lays groundwork for RL policies operating on compact, task-agnostic state representations in unstructured environments.

Abstract

Perception is essential for the active interaction of physical agents with the external environment. The integration of multiple sensory modalities, such as touch and vision, enhances this perceptual process, creating a more comprehensive and robust understanding of the world. Such fusion is particularly useful for highly deformable bodies such as soft robots. Developing a compact, yet comprehensive state representation from multi-sensory inputs can pave the way for the development of complex control strategies. This paper introduces a perception model that harmonizes data from diverse modalities to build a holistic state representation and assimilate essential information. The model relies on the causality between sensory input and robotic actions, employing a generative model to efficiently compress fused information and predict the next observation. We present, for the first time, a study on how touch can be predicted from vision and proprioception on soft robots, the importance of the cross-modal generation and why this is essential for soft robotic interactions in unstructured environments.

Multi-modal perception for soft robotic interactions using generative models

TL;DR

The paper tackles building a compact, multi-modal state representation for soft robots by introducing an action-conditioned generative model that fuses proprioception, vision, and touch. It employs a Conditional Variational Autoencoder to encode fused inputs into a latent state and predict the next observation conditioned on action , with an auxiliary reconstruction path to assess information preservation. Experiments in a SoMo-based soft finger simulation compare single- and multi-modality predictions, examining how latent dimension and fusion affect predictive accuracy and reconstruction quality, and highlighting that vision enhances touch prediction and force estimation. This framework offers a principled route to perceptually aware soft-robot control and lays groundwork for RL policies operating on compact, task-agnostic state representations in unstructured environments.

Abstract

Perception is essential for the active interaction of physical agents with the external environment. The integration of multiple sensory modalities, such as touch and vision, enhances this perceptual process, creating a more comprehensive and robust understanding of the world. Such fusion is particularly useful for highly deformable bodies such as soft robots. Developing a compact, yet comprehensive state representation from multi-sensory inputs can pave the way for the development of complex control strategies. This paper introduces a perception model that harmonizes data from diverse modalities to build a holistic state representation and assimilate essential information. The model relies on the causality between sensory input and robotic actions, employing a generative model to efficiently compress fused information and predict the next observation. We present, for the first time, a study on how touch can be predicted from vision and proprioception on soft robots, the importance of the cross-modal generation and why this is essential for soft robotic interactions in unstructured environments.
Paper Structure (12 sections, 5 equations, 9 figures)

This paper contains 12 sections, 5 equations, 9 figures.

Figures (9)

  • Figure 1: The simulated environment where a soft robot interacting in an unstructured environment attempts to combine visual-proprio feedback for compact state representation and tactile prediction.
  • Figure 2: Fusion and prediction learning architecture. (a) Prediction of the next sensory observation starting from a current observation and performed action, after undergoing a fusion and compression stage. (b) Implementation of the model on a Conditional Variational Auto-Encoder.
  • Figure 3: Information reconstruction learning architecture.
  • Figure 4: Robotic platform simulation setup. The passive finger is mounted at the distal end of a cylindrical rigid robot and interacts with the ground, or eventually movable objects. The finger presents 20 DoFs and it makes flexion/extension and adduction/abduction movements.
  • Figure 5: Contact forces in simulation.
  • ...and 4 more figures