Table of Contents
Fetching ...

Cross-Modal Visuo-Tactile Object Perception

Anirvan Dutta, Simone Tasciotti, Claudia Cusseddu, Ang Li, Panayiota Poirazi, Julijana Gjorgjieva, Etienne Burdet, Patrick van der Smagt, Mohsen Kaboli

Abstract

Estimating physical properties is critical for safe and efficient autonomous robotic manipulation, particularly during contact-rich interactions. In such settings, vision and tactile sensing provide complementary information about object geometry, pose, inertia, stiffness, and contact dynamics, such as stick-slip behavior. However, these properties are only indirectly observable and cannot always be modeled precisely (e.g., deformation in non-rigid objects coupled with nonlinear contact friction), making the estimation problem inherently complex and requiring sustained exploitation of visuo-tactile sensory information during action. Existing visuo-tactile perception frameworks have primarily emphasized forceful sensor fusion or static cross-modal alignment, with limited consideration of how uncertainty and beliefs about object properties evolve over time. Inspired by human multi-sensory perception and active inference, we propose the Cross-Modal Latent Filter (CMLF) to learn a structured, causal latent state-space of physical object properties. CMLF supports bidirectional transfer of cross-modal priors between vision and touch and integrates sensory evidence through a Bayesian inference process that evolves over time. Real-world robotic experiments demonstrate that CMLF improves the efficiency and robustness of latent physical properties estimation under uncertainty compared to baseline approaches. Beyond performance gains, the model exhibits perceptual coupling phenomena analogous to those observed in humans, including susceptibility to cross-modal illusions and similar trajectories in learning cross-sensory associations. Together, these results constitutes a significant step toward generalizable, robust and physically consistent cross-modal integration for robotic multi-sensory perception.

Cross-Modal Visuo-Tactile Object Perception

Abstract

Estimating physical properties is critical for safe and efficient autonomous robotic manipulation, particularly during contact-rich interactions. In such settings, vision and tactile sensing provide complementary information about object geometry, pose, inertia, stiffness, and contact dynamics, such as stick-slip behavior. However, these properties are only indirectly observable and cannot always be modeled precisely (e.g., deformation in non-rigid objects coupled with nonlinear contact friction), making the estimation problem inherently complex and requiring sustained exploitation of visuo-tactile sensory information during action. Existing visuo-tactile perception frameworks have primarily emphasized forceful sensor fusion or static cross-modal alignment, with limited consideration of how uncertainty and beliefs about object properties evolve over time. Inspired by human multi-sensory perception and active inference, we propose the Cross-Modal Latent Filter (CMLF) to learn a structured, causal latent state-space of physical object properties. CMLF supports bidirectional transfer of cross-modal priors between vision and touch and integrates sensory evidence through a Bayesian inference process that evolves over time. Real-world robotic experiments demonstrate that CMLF improves the efficiency and robustness of latent physical properties estimation under uncertainty compared to baseline approaches. Beyond performance gains, the model exhibits perceptual coupling phenomena analogous to those observed in humans, including susceptibility to cross-modal illusions and similar trajectories in learning cross-sensory associations. Together, these results constitutes a significant step toward generalizable, robust and physically consistent cross-modal integration for robotic multi-sensory perception.

Paper Structure

This paper contains 21 sections, 19 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Concept of the proposed Cross-Modal Latent Filter (CMLF) perception framework. Inspired by human multi-sensory processing, the model uses extrinsic visual cues to form priors over intrinsic object properties, and vice versa, in a fully unsupervised manner from raw visual and tactile data. Bayesian integration (BI) underpins this cross-modal inference, improving the robustness and efficiency of object property estimation by exploiting statistical regularities across modalities.
  • Figure 2: Robotic experimental setup and data collection pipeline. a) The figure illustrates the main stages used to construct the visuo-tactile interaction dataset. The process begins with initial estimation of the object shape and pose, which is used to autonomously perform prehensile robotic interactions. These interactions generate rich, time-varying visual and tactile observations capturing both geometric and contact dynamics. The resulting multi-sensory streams are then processed by the proposed Cross-Modal Latent Filter (CMLF) to infer latent object properties and interaction dynamics over time. b) To emulate everyday non-rigid objects, we designed a set of synthetic objects with configurable material properties, enabling controlled variation in flexibility. This allows us to systematically control intrinsic (e.g. mass, stiffness, surface friction) and extrinsic (e.g. size, shape, visual texture) physical attributes and to study their cross-modal associations.
  • Figure 3: Cross-modal latent filter architecture with cross-modal connections from vision to tactile (CM-V2T) and tactile to vision (CM-T2V) modalities. The initial priors are set to $q^{filt}(\mathbf{z}_0^{V/T})\sim \mathcal{N}(0,1)$.
  • Figure 4: Classification and regression to evaluate inference performance. Statistical significance is assessed using paired t-tests with Holm--Bonferroni correction for multiple comparisons, with *** denoting $p < 0.001$, **$p < 0.01$, *$p < 0.05$, and non-significant comparisons are omitted. (a) Classification accuracy for each latent feature with higher values indicate better performance and error bars denote $\pm\,1$ standard deviation. (b) Regression performance measured by trajectory NMSE for each physical property, with error bars denote $\pm\,1$ standard deviation; lower values indicate better estimation. (c) Temporal evolution of NMSE for intrinsic and extrinsic properties, with shaded regions indicating $\pm 0.1$ standard deviation across interactions; lower values indicate better estimation. The results show that cross-modal priors from extrinsic properties significantly improve the efficiency of intrinsic object property inference compared to baseline approaches.
  • Figure 5: CMLF perceptual similarity to human inference. a) Mean trajectory error on the surprise set. Statistical significance is evaluated using paired t-tests with Holm--Bonferroni correction. b) Effect of delayed activation of cross-modality priors on inference of intrinsic properties. c) Temporal evolution of NMSE on the surprise set, with shaded regions indicating $\pm 0.1$ standard deviation. d) Representative examples illustrating how latent filtering and cross-modal priors differ between the aligned and surprise sets. In the aligned set, the visual-to-tactile prior ($CM-V2T$) provides informative cues derived from extrinsic properties, enabling faster convergence of latent estimates. With the surprise set, the prior trained on aligned set's cross-modal association becomes misleading; however Bayesian integration gradually corrects the estimate toward the ground-truth.
  • ...and 6 more figures