Table of Contents
Fetching ...

Vi-TacMan: Articulated Object Manipulation via Vision and Touch

Leiyao Cui, Zihang Zhao, Sirui Xie, Wenhuan Zhang, Zhi Han, Yixin Zhu

TL;DR

Vi-TacMan addresses articulated-object manipulation in unstructured environments by pairing vision-based coarse guidance with tactile, contact-regulated execution. The framework uses surface normals as geometric priors and models directional uncertainty on the unit sphere with a von Mises-Fisher distribution to robustly infer interaction directions without explicit kinematic models. A detection module achieves $0.86$ mAP and, together with a PointNet++ displacement estimator and GelSight-based tactile policy, enables a complete vision-to-touch manipulation pipeline. Evaluations on over $5\times 10^4$ simulations and diverse real objects show strong cross-category generalization and statistically significant improvements ($p<0.0001$).

Abstract

Autonomous manipulation of articulated objects remains a fundamental challenge for robots in human environments. Vision-based methods can infer hidden kinematics but can yield imprecise estimates on unfamiliar objects. Tactile approaches achieve robust control through contact feedback but require accurate initialization. This suggests a natural synergy: vision for global guidance, touch for local precision. Yet no framework systematically exploits this complementarity for generalized articulated manipulation. Here we present Vi-TacMan, which uses vision to propose grasps and coarse directions that seed a tactile controller for precise execution. By incorporating surface normals as geometric priors and modeling directions via von Mises-Fisher distributions, our approach achieves significant gains over baselines (all p<0.0001). Critically, manipulation succeeds without explicit kinematic models -- the tactile controller refines coarse visual estimates through real-time contact regulation. Tests on more than 50,000 simulated and diverse real-world objects confirm robust cross-category generalization. This work establishes that coarse visual cues suffice for reliable manipulation when coupled with tactile feedback, offering a scalable paradigm for autonomous systems in unstructured environments.

Vi-TacMan: Articulated Object Manipulation via Vision and Touch

TL;DR

Vi-TacMan addresses articulated-object manipulation in unstructured environments by pairing vision-based coarse guidance with tactile, contact-regulated execution. The framework uses surface normals as geometric priors and models directional uncertainty on the unit sphere with a von Mises-Fisher distribution to robustly infer interaction directions without explicit kinematic models. A detection module achieves mAP and, together with a PointNet++ displacement estimator and GelSight-based tactile policy, enables a complete vision-to-touch manipulation pipeline. Evaluations on over simulations and diverse real objects show strong cross-category generalization and statistically significant improvements ().

Abstract

Autonomous manipulation of articulated objects remains a fundamental challenge for robots in human environments. Vision-based methods can infer hidden kinematics but can yield imprecise estimates on unfamiliar objects. Tactile approaches achieve robust control through contact feedback but require accurate initialization. This suggests a natural synergy: vision for global guidance, touch for local precision. Yet no framework systematically exploits this complementarity for generalized articulated manipulation. Here we present Vi-TacMan, which uses vision to propose grasps and coarse directions that seed a tactile controller for precise execution. By incorporating surface normals as geometric priors and modeling directions via von Mises-Fisher distributions, our approach achieves significant gains over baselines (all p<0.0001). Critically, manipulation succeeds without explicit kinematic models -- the tactile controller refines coarse visual estimates through real-time contact regulation. Tests on more than 50,000 simulated and diverse real-world objects confirm robust cross-category generalization. This work establishes that coarse visual cues suffice for reliable manipulation when coupled with tactile feedback, offering a scalable paradigm for autonomous systems in unstructured environments.

Paper Structure

This paper contains 15 sections, 10 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Overview of Vi-TacMan. Vi-TacMan exploits the complementary strengths of vision and touch for manipulating unseen articulated objects. Vision provides global context to propose grasps and estimate coarse interaction directions, which initialize a tactile controller that leverages local contact feedback for precise and robust execution.
  • Figure 2: Inputs to the vision module of Vi-TacMan. The vision module of Vi-TacMan processes RGB-D data from a depth sensor, surface normals computed from the depth map (visualized as a normal map), and instance-level semantic masks identifying holdable and movable parts. This representation accommodates objects with multiple interactable components. Note: Holdable masks are subsets of their associated movable masks; regions appear overlapped in the visualization.
  • Figure 3: Coupling between grasp point and interaction direction. The interaction direction depends on the selected grasp point even when the same rigid transformation is applied. Different point selections yield different directions under identical transformations.
  • Figure 4: Real-world articulated objects and processing pipeline. (a) We evaluate Vi-TacMan on real-world objects spanning diverse configurations: prismatic to revolute joints, and single-part to multi-part structures. (b) Our trained detector reliably identifies movable and holdable parts, even in complex multi-part cases. (c) These detections provide prompts for the segmentation model, enabling fine-grained part segmentation. (d) Based on segmented parts, suitable grasps are generated at grasping points $\boldsymbol{g}$. These results provide the necessary information for inferring interaction directions.
  • Figure 5: Depth refinement using foundation models. We leverage a depth foundation model yang2024depth to refine raw depth measurements from the image sensor. Left: raw depth. Right: refined depth. Both visualizations use the same colorbar range for comparability.
  • ...and 5 more figures