Table of Contents
Fetching ...

Symmetry-Aware Fusion of Vision and Tactile Sensing via Bilateral Force Priors for Robotic Manipulation

Wonju Lee, Matteo Grimaldi, Tao Yu

TL;DR

A Cross-Modal Transformer for visuo-tactile fusion that integrates wrist-camera observations with tactile signals through structured self- and cross-attention is proposed, and a physics-informed regularization is introduced that encourages bilateral force balance, reflecting principles of human motor control.

Abstract

Insertion tasks in robotic manipulation demand precise, contact-rich interactions that vision alone cannot resolve. While tactile feedback is intuitively valuable, existing studies have shown that naïve visuo-tactile fusion often fails to deliver consistent improvements. In this work, we propose a Cross-Modal Transformer (CMT) for visuo-tactile fusion that integrates wrist-camera observations with tactile signals through structured self- and cross-attention. To stabilize tactile embeddings, we further introduce a physics-informed regularization that encourages bilateral force balance, reflecting principles of human motor control. Experiments on the TacSL benchmark show that CMT with symmetry regularization achieves a 96.59% insertion success rate, surpassing naïve and gated fusion baselines and closely matching the privileged "wrist + contact force" configuration (96.09%). These results highlight two central insights: (i) tactile sensing is indispensable for precise alignment, and (ii) principled multimodal fusion, further strengthened by physics-informed regularization, unlocks complementary strengths of vision and touch, approaching privileged performance under realistic sensing.

Symmetry-Aware Fusion of Vision and Tactile Sensing via Bilateral Force Priors for Robotic Manipulation

TL;DR

A Cross-Modal Transformer for visuo-tactile fusion that integrates wrist-camera observations with tactile signals through structured self- and cross-attention is proposed, and a physics-informed regularization is introduced that encourages bilateral force balance, reflecting principles of human motor control.

Abstract

Insertion tasks in robotic manipulation demand precise, contact-rich interactions that vision alone cannot resolve. While tactile feedback is intuitively valuable, existing studies have shown that naïve visuo-tactile fusion often fails to deliver consistent improvements. In this work, we propose a Cross-Modal Transformer (CMT) for visuo-tactile fusion that integrates wrist-camera observations with tactile signals through structured self- and cross-attention. To stabilize tactile embeddings, we further introduce a physics-informed regularization that encourages bilateral force balance, reflecting principles of human motor control. Experiments on the TacSL benchmark show that CMT with symmetry regularization achieves a 96.59% insertion success rate, surpassing naïve and gated fusion baselines and closely matching the privileged "wrist + contact force" configuration (96.09%). These results highlight two central insights: (i) tactile sensing is indispensable for precise alignment, and (ii) principled multimodal fusion, further strengthened by physics-informed regularization, unlocks complementary strengths of vision and touch, approaching privileged performance under realistic sensing.
Paper Structure (24 sections, 7 equations, 6 figures, 5 tables)

This paper contains 24 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison of observation modalities for robotic insertion policies. Left: Vision-only input provides global alignment cues but lacks local precision. Center: Tactile-only input encodes fine-grained force signals critical for corrective actions. Right: Visuo-tactile fusion integrates coarse visual guidance with detailed tactile feedback, achieving robust insertion by exploiting complementary strengths.
  • Figure 2: Overview of visuo-tactile fusion architectures. (a) Naïve concatenation of embeddings, which risks diluting modality-specific signals. (b) Gated fusion with linear layers that adaptively weight neuronal contributions. (c) The proposed Cross-Modal Transformer (CMT), which embeds symmetry-aware tactile encoding and integrates vision and touch via cross-attention.
  • Figure 3: Physics-informed symmetry regularization. The right tactile map is vertically flipped and encoded as $\tilde{h}_t^R$, then compared with $h_t^L$. The mean squared error loss penalizes deviations, encouraging bilateral consistency. This auxiliary objective stabilizes grasp forces before insertion and reduces lateral misalignment during insertion.
  • Figure 4: Evolution of bilateral force fields during insertion under two fusion strategies. Top: Naïve fusion does not enforce symmetry; contact with the socket induces pronounced left–right imbalance, triggering unstable corrections and occasional re-grasping. Bottom: The proposed symmetry-aware CMT maintains balanced force distributions throughout the episode, reducing unnecessary lateral contact and yielding a straighter, smoother insertion trajectory aligned with the table normal. This illustrates how explicitly modeling bilateral symmetry stabilizes contact-rich manipulation under visuo-tactile fusion.
  • Figure 5: Distributions of insertion performance for Naïve (blue), Gated (orange), and CMT (green). Scatter points denote individual trials, with kernel density contours indicating outcome distributions in terms of success rate (x-axis) and steps to succeed (y-axis).
  • ...and 1 more figures