Table of Contents
Fetching ...

ReTac-ACT: A State-Gated Vision-Tactile Fusion Transformer for Precision Assembly

Minchi Ruan, LiangQing Zhou, Hongtong Li, Zongtao Wang, ZhaoMing Lu, Jianwei Zhang, Bin Fang

TL;DR

ReTac-ACT (Reconstruction-enhanced Tactile ACT), a vision-tactile imitation learning policy that addresses this challenge through three synergistic mechanisms, achieves 90% peg-in-hole success, substantially outperforming vision-only and generalist baseline methods, and maintains 80% success at industrial-grade 0.1mm clearance.

Abstract

Precision assembly requires sub-millimeter corrections in contact-rich "last-millimeter" regions where visual feedback fails due to occlusion from the end-effector and workpiece. We present ReTac-ACT (Reconstruction-enhanced Tactile ACT), a vision-tactile imitation learning policy that addresses this challenge through three synergistic mechanisms: (i) bidirectional cross-attention enabling reciprocal visuo-tactile feature enhancement before fusion, (ii) a proprioception-conditioned gating network that dynamically elevates tactile reliance when visual occlusion occurs, and (iii) a tactile reconstruction objective enforcing learning of manipulation-relevant contact information rather than generic visual textures. Evaluated on the standardized NIST Assembly Task Board M1 benchmark, ReTac-ACT achieves 90% peg-in-hole success, substantially outperforming vision-only and generalist baseline methods, and maintains 80% success at industrial-grade 0.1mm clearance. Ablation studies validate that each architectural component is indispensable. The ReTac-ACT codebase and a vision-tactile demonstration dataset covering various clearance levels with both visual and tactile features will be released to support reproducible research.

ReTac-ACT: A State-Gated Vision-Tactile Fusion Transformer for Precision Assembly

TL;DR

ReTac-ACT (Reconstruction-enhanced Tactile ACT), a vision-tactile imitation learning policy that addresses this challenge through three synergistic mechanisms, achieves 90% peg-in-hole success, substantially outperforming vision-only and generalist baseline methods, and maintains 80% success at industrial-grade 0.1mm clearance.

Abstract

Precision assembly requires sub-millimeter corrections in contact-rich "last-millimeter" regions where visual feedback fails due to occlusion from the end-effector and workpiece. We present ReTac-ACT (Reconstruction-enhanced Tactile ACT), a vision-tactile imitation learning policy that addresses this challenge through three synergistic mechanisms: (i) bidirectional cross-attention enabling reciprocal visuo-tactile feature enhancement before fusion, (ii) a proprioception-conditioned gating network that dynamically elevates tactile reliance when visual occlusion occurs, and (iii) a tactile reconstruction objective enforcing learning of manipulation-relevant contact information rather than generic visual textures. Evaluated on the standardized NIST Assembly Task Board M1 benchmark, ReTac-ACT achieves 90% peg-in-hole success, substantially outperforming vision-only and generalist baseline methods, and maintains 80% success at industrial-grade 0.1mm clearance. Ablation studies validate that each architectural component is indispensable. The ReTac-ACT codebase and a vision-tactile demonstration dataset covering various clearance levels with both visual and tactile features will be released to support reproducible research.
Paper Structure (23 sections, 6 equations, 6 figures, 2 tables)

This paper contains 23 sections, 6 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: We present ReTac-ACT, a state-gated vision-tactile policy that extends Action Chunking with Transformers (ACT) to natively process tactile feedback. ReTac-ACT sets a new state of the art for high-precision peg-in-hole tasks on the NIST ATB M1 benchmark provided by ManipulationNet, achieving 90% success at 3 mm clearance and maintaining 80% at industrial-grade 0.1 mm clearance where pure vision fails due to occlusion. It features a proprioception-conditioned gating mechanism to dynamically fuse visual and tactile modalities and is trained with auxiliary tactile reconstruction objectives. The ReTac-ACT code will be made open-source to support the research community.
  • Figure 2: Overview of the ReTac-ACT architecture. (a) Multi-Modal Encoders: Visual inputs (3 RGB cameras) and tactile inputs (4 contact images: one sensor per fingertip, two fingers per gripper, bimanual) are processed by separate backbones into feature tokens. (b) Cross-Modal Dynamic Fusion: A proprioception-based gating network dynamically weighs modalities, enhanced by bidirectional cross-attention. (c) Action Generator: A CVAE-based transformer decoder predicts temporal action chunks $\hat{a}_{t:t+k-1}$, where each action includes 14-DoF bimanual joint targets and 2 gripper commands.
  • Figure 3: Tactile representation learning via auxiliary reconstruction. During training, an image reconstruction objective regularizes the tactile encoder. By employing a decoder to reconstruct the raw tactile inputs from the learned latent tokens, the model is explicitly forced to capture fine-grained contact geometry, preventing feature collapse.
  • Figure 4: Hardware setup for the bimanual vision-tactile precision-assembly system.
  • Figure 5: Robustness to tighter clearances. As clearance tightens from 3 mm to 0.1 mm, ReTac-ACT degrades only 11% (90%$\to$80%) while ACT degrades 62.5% (40%$\to$15%) and Diffusion Policy (DP) fails (20%$\to$0%). Dashed lines highlight the degradation trends.
  • ...and 1 more figures