TouchFormer: A Robust Transformer-based Framework for Multimodal Material Perception
Kailin Lyu, Long Xiao, Jianing Zeng, Junhao Dong, Xuexin Liu, Zhuojun Zou, Haoyue Yang, Lin Shu, Jie Hao
TL;DR
TouchFormer tackles the brittleness of vision-free material perception by introducing three core ideas: Modality-Adaptive Gating (MAG) to filter and weight modalities, intra- and inter-modal Transformer fusion to fuse asynchronous multimodal signals, and Cross-Instance Embedding Regularization (CER) to sharpen the embedding space. The framework jointly optimizes these components with a combined loss $\mathcal{L}_{total}=\mathcal{L}_{cls}+\lambda\mathcal{L}_{C}$, achieving consistent improvements over non-visual baselines and maintaining robustness under modality noise. It validates the approach on LMTHM and the Fire Incident Sound and Haptic Material (FISHM) datasets, and demonstrates real-world robotic applicability in fire-like, vision-constrained environments. The results show TouchFormer enhances material perception accuracy by up to $6.83\%$ on USMC and maintains strong performance without vision, underscoring its practical potential for safety-critical robotics and autonomous systems.
Abstract
Traditional vision-based material perception methods often experience substantial performance degradation under visually impaired conditions, thereby motivating the shift toward non-visual multimodal material perception. Despite this, existing approaches frequently perform naive fusion of multimodal inputs, overlooking key challenges such as modality-specific noise, missing modalities common in real-world scenarios, and the dynamically varying importance of each modality depending on the task. These limitations lead to suboptimal performance across several benchmark tasks. In this paper, we propose a robust multimodal fusion framework, TouchFormer. Specifically, we employ a Modality-Adaptive Gating (MAG) mechanism and intra- and inter-modality attention mechanisms to adaptively integrate cross-modal features, enhancing model robustness. Additionally, we introduce a Cross-Instance Embedding Regularization(CER) strategy, which significantly improves classification accuracy in fine-grained subcategory material recognition tasks. Experimental results demonstrate that, compared to existing non-visual methods, the proposed TouchFormer framework achieves classification accuracy improvements of 2.48% and 6.83% on SSMC and USMC tasks, respectively. Furthermore, real-world robotic experiments validate TouchFormer's effectiveness in enabling robots to better perceive and interpret their environment, paving the way for its deployment in safety-critical applications such as emergency response and industrial automation. The code and datasets will be open-source, and the videos are available in the supplementary materials.
