Table of Contents
Fetching ...

TouchFormer: A Robust Transformer-based Framework for Multimodal Material Perception

Kailin Lyu, Long Xiao, Jianing Zeng, Junhao Dong, Xuexin Liu, Zhuojun Zou, Haoyue Yang, Lin Shu, Jie Hao

TL;DR

TouchFormer tackles the brittleness of vision-free material perception by introducing three core ideas: Modality-Adaptive Gating (MAG) to filter and weight modalities, intra- and inter-modal Transformer fusion to fuse asynchronous multimodal signals, and Cross-Instance Embedding Regularization (CER) to sharpen the embedding space. The framework jointly optimizes these components with a combined loss $\mathcal{L}_{total}=\mathcal{L}_{cls}+\lambda\mathcal{L}_{C}$, achieving consistent improvements over non-visual baselines and maintaining robustness under modality noise. It validates the approach on LMTHM and the Fire Incident Sound and Haptic Material (FISHM) datasets, and demonstrates real-world robotic applicability in fire-like, vision-constrained environments. The results show TouchFormer enhances material perception accuracy by up to $6.83\%$ on USMC and maintains strong performance without vision, underscoring its practical potential for safety-critical robotics and autonomous systems.

Abstract

Traditional vision-based material perception methods often experience substantial performance degradation under visually impaired conditions, thereby motivating the shift toward non-visual multimodal material perception. Despite this, existing approaches frequently perform naive fusion of multimodal inputs, overlooking key challenges such as modality-specific noise, missing modalities common in real-world scenarios, and the dynamically varying importance of each modality depending on the task. These limitations lead to suboptimal performance across several benchmark tasks. In this paper, we propose a robust multimodal fusion framework, TouchFormer. Specifically, we employ a Modality-Adaptive Gating (MAG) mechanism and intra- and inter-modality attention mechanisms to adaptively integrate cross-modal features, enhancing model robustness. Additionally, we introduce a Cross-Instance Embedding Regularization(CER) strategy, which significantly improves classification accuracy in fine-grained subcategory material recognition tasks. Experimental results demonstrate that, compared to existing non-visual methods, the proposed TouchFormer framework achieves classification accuracy improvements of 2.48% and 6.83% on SSMC and USMC tasks, respectively. Furthermore, real-world robotic experiments validate TouchFormer's effectiveness in enabling robots to better perceive and interpret their environment, paving the way for its deployment in safety-critical applications such as emergency response and industrial automation. The code and datasets will be open-source, and the videos are available in the supplementary materials.

TouchFormer: A Robust Transformer-based Framework for Multimodal Material Perception

TL;DR

TouchFormer tackles the brittleness of vision-free material perception by introducing three core ideas: Modality-Adaptive Gating (MAG) to filter and weight modalities, intra- and inter-modal Transformer fusion to fuse asynchronous multimodal signals, and Cross-Instance Embedding Regularization (CER) to sharpen the embedding space. The framework jointly optimizes these components with a combined loss , achieving consistent improvements over non-visual baselines and maintaining robustness under modality noise. It validates the approach on LMTHM and the Fire Incident Sound and Haptic Material (FISHM) datasets, and demonstrates real-world robotic applicability in fire-like, vision-constrained environments. The results show TouchFormer enhances material perception accuracy by up to on USMC and maintains strong performance without vision, underscoring its practical potential for safety-critical robotics and autonomous systems.

Abstract

Traditional vision-based material perception methods often experience substantial performance degradation under visually impaired conditions, thereby motivating the shift toward non-visual multimodal material perception. Despite this, existing approaches frequently perform naive fusion of multimodal inputs, overlooking key challenges such as modality-specific noise, missing modalities common in real-world scenarios, and the dynamically varying importance of each modality depending on the task. These limitations lead to suboptimal performance across several benchmark tasks. In this paper, we propose a robust multimodal fusion framework, TouchFormer. Specifically, we employ a Modality-Adaptive Gating (MAG) mechanism and intra- and inter-modality attention mechanisms to adaptively integrate cross-modal features, enhancing model robustness. Additionally, we introduce a Cross-Instance Embedding Regularization(CER) strategy, which significantly improves classification accuracy in fine-grained subcategory material recognition tasks. Experimental results demonstrate that, compared to existing non-visual methods, the proposed TouchFormer framework achieves classification accuracy improvements of 2.48% and 6.83% on SSMC and USMC tasks, respectively. Furthermore, real-world robotic experiments validate TouchFormer's effectiveness in enabling robots to better perceive and interpret their environment, paving the way for its deployment in safety-critical applications such as emergency response and industrial automation. The code and datasets will be open-source, and the videos are available in the supplementary materials.

Paper Structure

This paper contains 20 sections, 12 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: (a) In the SSMC task, when facing real-world modality noise, TouchFormer shows only minor performance loss, in contrast to substantial degradation in baselines like MTCNN wangsuo and DDQN liu2023surface. (b) Under the simulated fire scenario prototype, the TouchFormer model collects and processes multimodal data (FISHM). Then, leveraging a physics engine to enable robust material perception and manipulation under extreme and vision-constrained conditions.
  • Figure 2: Overview of the proposed framework. TouchFormer receives noisy or incomplete multimodal sequences as input and employs Modality-Adaptive Gating (MAG) to dynamically assess the quality of each modality. It then adaptively integrates cross-modal features within the latent space using both intra- and inter-modality attention mechanisms, without requiring explicit alignment. Finally, Cross-Instance Embedding Regularization (CER) is applied to improve the clarity and discriminability of the representation space, thereby facilitating robust surface material classification.
  • Figure 3: The FISHM dataset comprises seven distinct categories of daily objects, encompassing representative item types potentially encountered in fire incidents.
  • Figure 4: TouchFormer confusion matrix performance on the SSMC task. (a) Coarse-grained classification results across 8 material classes. (b) Fine-grained classification results (e.g., using category C6 as an example).
  • Figure 5: Visualization of subclass embeddings with and without CER. t-SNE plots of feature embeddings for coarse- and fine-grained classification.
  • ...and 2 more figures