Table of Contents
Fetching ...

ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers

Fotios Lygerakis, Ozan Özdenizci, Elmar Rückert

TL;DR

ViTaPEs introduces a transformer framework with multi-scale visuotactile positional encodings that inject modality-specific and global spatial structure into fused visual-tactile representations. The encodings are designed to be injective, rigid-motion-equivariant, and information-preserving, with formal guarantees supported by empirical validation. Across material-property recognition, object detection, zero-shot generalization, and robotic grasping, ViTaPEs achieves state-of-the-art performance and strong cross-sensor transfer, including robust zero-shot generalization to unseen sensors. The work demonstrates that carefully designed multi-scale positional encodings enable robust cross-modal alignment, reducing reliance on large vision-language pretraining and enabling practical visuotactile perception in real-world robotics. The approach scales with transformer capacity, maintains efficiency, and highlights clear avenues for extending visuotactile fusion to larger models and more complex manipulation tasks, guided by theoretical guarantees.

Abstract

Tactile sensing provides local essential information that is complementary to visual perception, such as texture, compliance, and force. Despite recent advances in visuotactile representation learning, challenges remain in fusing these modalities and generalizing across tasks and environments without heavy reliance on pre-trained vision-language models. Moreover, existing methods do not study positional encodings, thereby overlooking the multi-scale spatial reasoning needed to capture fine-grained visuotactile correlations. We introduce ViTaPEs, a transformer-based framework that robustly integrates visual and tactile input data to learn task-agnostic representations for visuotactile perception. Our approach exploits a novel multi-scale positional encoding scheme to capture intra-modal structures, while simultaneously modeling cross-modal cues. Unlike prior work, we provide provable guarantees in visuotactile fusion, showing that our encodings are injective, rigid-motion-equivariant, and information-preserving, validating these properties empirically. Experiments on multiple large-scale real-world datasets show that ViTaPEs not only surpasses state-of-the-art baselines across various recognition tasks but also demonstrates zero-shot generalization to unseen, out-of-domain scenarios. We further demonstrate the transfer-learning strength of ViTaPEs in a robotic grasping task, where it outperforms state-of-the-art baselines in predicting grasp success. Project page: https://sites.google.com/view/vitapes

ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers

TL;DR

ViTaPEs introduces a transformer framework with multi-scale visuotactile positional encodings that inject modality-specific and global spatial structure into fused visual-tactile representations. The encodings are designed to be injective, rigid-motion-equivariant, and information-preserving, with formal guarantees supported by empirical validation. Across material-property recognition, object detection, zero-shot generalization, and robotic grasping, ViTaPEs achieves state-of-the-art performance and strong cross-sensor transfer, including robust zero-shot generalization to unseen sensors. The work demonstrates that carefully designed multi-scale positional encodings enable robust cross-modal alignment, reducing reliance on large vision-language pretraining and enabling practical visuotactile perception in real-world robotics. The approach scales with transformer capacity, maintains efficiency, and highlights clear avenues for extending visuotactile fusion to larger models and more complex manipulation tasks, guided by theoretical guarantees.

Abstract

Tactile sensing provides local essential information that is complementary to visual perception, such as texture, compliance, and force. Despite recent advances in visuotactile representation learning, challenges remain in fusing these modalities and generalizing across tasks and environments without heavy reliance on pre-trained vision-language models. Moreover, existing methods do not study positional encodings, thereby overlooking the multi-scale spatial reasoning needed to capture fine-grained visuotactile correlations. We introduce ViTaPEs, a transformer-based framework that robustly integrates visual and tactile input data to learn task-agnostic representations for visuotactile perception. Our approach exploits a novel multi-scale positional encoding scheme to capture intra-modal structures, while simultaneously modeling cross-modal cues. Unlike prior work, we provide provable guarantees in visuotactile fusion, showing that our encodings are injective, rigid-motion-equivariant, and information-preserving, validating these properties empirically. Experiments on multiple large-scale real-world datasets show that ViTaPEs not only surpasses state-of-the-art baselines across various recognition tasks but also demonstrates zero-shot generalization to unseen, out-of-domain scenarios. We further demonstrate the transfer-learning strength of ViTaPEs in a robotic grasping task, where it outperforms state-of-the-art baselines in predicting grasp success. Project page: https://sites.google.com/view/vitapes

Paper Structure

This paper contains 60 sections, 3 theorems, 29 equations, 6 figures, 8 tables.

Key Result

Theorem 3.1

Under the implementation assumptions, the map $\Phi: (V,T)\mapsto X_{\mathrm{projected}}(V,T) + \mathbf{PE}^{\text{global}} \quad\text{is injective on }\mathbb R^{N_{\text{visual}}\times D}\times\mathbb R^{N_{\text{tactile}}\times D}$.

Figures (6)

  • Figure 1: Task-accuracy radar comparing visuotactile models. ViTaPEs outperforms all others in robustness and cross-domain generalization.
  • Figure 2: ViTaPEs framework: The visual and tactile inputs are projected into separate token spaces, followed by the addition of modality-specific (green and orange) and shared (purple) global PEs for multi-modal fusion.
  • Figure 3: Learned PEs in ViTaPEs after training: visual, tactile, and global (left to right). Each PE exhibits a unique spatial structure reflecting modality-specific priors and representational needs.
  • Figure 4: Cosine‐similarity heatmap of all positional‐encoding rows at epoch 50. The bright diagonal indicates self‐similarity of 1, while off‐diagonal values hover around 0.35–0.40, confirming no row collisions (Assumption A1).
  • Figure 5: Log‐scale singular‐value spectrum of $W_g$ over 80 epochs. Singular values decrease smoothly from $\approx 1.0$ to $\approx 0.15$, and the minimum remains well above zero, validating full‐rankness (Assumption A2).
  • ...and 1 more figures

Theorems & Definitions (3)

  • Theorem 3.1: Injectivity
  • Theorem 3.2: Cross-Modal Translation Equivariance
  • Proposition 3.3: Entropy Preservation