Table of Contents
Fetching ...

TacVLA: Contact-Aware Tactile Fusion for Robust Vision-Language-Action Manipulation

Kaidi Zhang, Heng Zhang, Zhengtong Xu, Zhiyuan Zhang, Md Rakibul Islam Prince, Xiang Li, Xiaojing Han, Yuhao Zhou, Arash Ajoudani, Yu She

Abstract

Vision-Language-Action (VLA) models have demonstrated significant advantages in robotic manipulation. However, their reliance on vision and language often leads to suboptimal performance in tasks involving visual occlusion, fine-grained manipulation, and physical contact. To address these challenges, we propose TacVLA, a fine-tuned VLA model by incorporating tactile modalities into the transformer-based policy to enhance fine-grained manipulation capabilities. Specifically, we introduce a contact-aware gating mechanism that selectively activates tactile tokens only when contact is detected, enabling adaptive multimodal fusion while avoiding irrelevant tactile interference. The fused visual, language, and tactile tokens are jointly processed within the transformer architecture to strengthen cross-modal grounding during contact-rich interaction. Extensive experiments on constraint-locked disassembly, in-box picking and robustness evaluations demonstrate that our model outperforms baselines, improving the performance by averaging 20% success rate in disassembly, 60% in in-box picking and 2.1x improvement in scenarios with visual occlusion. Videos are available at https://sites.google.com/view/tacvla and code will be released.

TacVLA: Contact-Aware Tactile Fusion for Robust Vision-Language-Action Manipulation

Abstract

Vision-Language-Action (VLA) models have demonstrated significant advantages in robotic manipulation. However, their reliance on vision and language often leads to suboptimal performance in tasks involving visual occlusion, fine-grained manipulation, and physical contact. To address these challenges, we propose TacVLA, a fine-tuned VLA model by incorporating tactile modalities into the transformer-based policy to enhance fine-grained manipulation capabilities. Specifically, we introduce a contact-aware gating mechanism that selectively activates tactile tokens only when contact is detected, enabling adaptive multimodal fusion while avoiding irrelevant tactile interference. The fused visual, language, and tactile tokens are jointly processed within the transformer architecture to strengthen cross-modal grounding during contact-rich interaction. Extensive experiments on constraint-locked disassembly, in-box picking and robustness evaluations demonstrate that our model outperforms baselines, improving the performance by averaging 20% success rate in disassembly, 60% in in-box picking and 2.1x improvement in scenarios with visual occlusion. Videos are available at https://sites.google.com/view/tacvla and code will be released.
Paper Structure (28 sections, 4 equations, 7 figures, 3 tables)

This paper contains 28 sections, 4 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Overview of TacVLA. (a) Input modalities including visual observations, language instructions, and tactile measurements. (b) TacVLA architecture, consisting of modality-specific encoders and tokenizer, a pretrained VLM backbone, an action expert, and the contact-aware gating module. (c) The proposed contact-aware gating module that selectively activates tactile tokens based on the contact state, enabling adaptive multimodal fusion during contact-rich manipulation. (d) Experimental evaluation on contact-rich constraint-locked disassembly and in-box picking tasks, together with robustness tests under camera occlusion and human disturbance.
  • Figure 2: Hardware setup: we utilize a 7 DoF Franka robotic platform equipped with tactile sensors and two cameras for visual input to evaluate our TacVLA model on contact-rich manipulation tasks.
  • Figure 3: Four contact-rich constraint-locked disassembly tasks with diverse geometric constraints: (a) Task 1: tight shaft; (b) Task 2: press clip; (c) Task 3: shaft rotation; (d) Task 4: slide pull.
  • Figure 4: The real-world experimental setup and procedures for four constraint-locked disassembly and in-box picking task. The experiments demonstrate the capability of TacVLA in contact-rich fine-grained manipulation, as well as its robustness to visual occlusion in the in-box picking scenario.
  • Figure 5: Robustness Evaluation: we evaluate the performance of our TacVLA model under conditions of visual occlusion and runtime disturbance to demonstrate its ability to adapt and maintain performance in challenging scenarios.
  • ...and 2 more figures