Vi-TacMan: Articulated Object Manipulation via Vision and Touch
Leiyao Cui, Zihang Zhao, Sirui Xie, Wenhuan Zhang, Zhi Han, Yixin Zhu
TL;DR
Vi-TacMan addresses articulated-object manipulation in unstructured environments by pairing vision-based coarse guidance with tactile, contact-regulated execution. The framework uses surface normals as geometric priors and models directional uncertainty on the unit sphere with a von Mises-Fisher distribution to robustly infer interaction directions without explicit kinematic models. A detection module achieves $0.86$ mAP and, together with a PointNet++ displacement estimator and GelSight-based tactile policy, enables a complete vision-to-touch manipulation pipeline. Evaluations on over $5\times 10^4$ simulations and diverse real objects show strong cross-category generalization and statistically significant improvements ($p<0.0001$).
Abstract
Autonomous manipulation of articulated objects remains a fundamental challenge for robots in human environments. Vision-based methods can infer hidden kinematics but can yield imprecise estimates on unfamiliar objects. Tactile approaches achieve robust control through contact feedback but require accurate initialization. This suggests a natural synergy: vision for global guidance, touch for local precision. Yet no framework systematically exploits this complementarity for generalized articulated manipulation. Here we present Vi-TacMan, which uses vision to propose grasps and coarse directions that seed a tactile controller for precise execution. By incorporating surface normals as geometric priors and modeling directions via von Mises-Fisher distributions, our approach achieves significant gains over baselines (all p<0.0001). Critically, manipulation succeeds without explicit kinematic models -- the tactile controller refines coarse visual estimates through real-time contact regulation. Tests on more than 50,000 simulated and diverse real-world objects confirm robust cross-category generalization. This work establishes that coarse visual cues suffice for reliable manipulation when coupled with tactile feedback, offering a scalable paradigm for autonomous systems in unstructured environments.
