Table of Contents
Fetching ...

ViTa-Zero: Zero-shot Visuotactile Object 6D Pose Estimation

Hongyu Li, James Akl, Srinath Sridhar, Tye Brady, Taskin Padir

TL;DR

ViTa-Zero tackles zero-shot visuotactile 6D object pose estimation by combining a visual backbone with physics-inspired feasibility checks and a test-time spring–mass refinement that leverages tactile and proprioceptive cues. The method maintains a pose $T=(R,\mathbf{t})\in SE(3)$ estimated from vision, and refines infeasible estimates using a proximal optimization that balances tactile attraction and robot penetration penalties. Across real-world experiments with multiple backbones and tasks, ViTa-Zero yields substantial improvements over purely visual baselines (average +55% ADD-S AUC, +60% ADD, and -80% PE relative to FoundationPose) and does not require tactile data collection for fine-tuning. The work highlights practical viability for robust manipulation under occlusion and contact, paving the way for tactile-augmented pose estimation that generalizes beyond sensor-specific setups.

Abstract

Object 6D pose estimation is a critical challenge in robotics, particularly for manipulation tasks. While prior research combining visual and tactile (visuotactile) information has shown promise, these approaches often struggle with generalization due to the limited availability of visuotactile data. In this paper, we introduce ViTa-Zero, a zero-shot visuotactile pose estimation framework. Our key innovation lies in leveraging a visual model as its backbone and performing feasibility checking and test-time optimization based on physical constraints derived from tactile and proprioceptive observations. Specifically, we model the gripper-object interaction as a spring-mass system, where tactile sensors induce attractive forces, and proprioception generates repulsive forces. We validate our framework through experiments on a real-world robot setup, demonstrating its effectiveness across representative visual backbones and manipulation scenarios, including grasping, object picking, and bimanual handover. Compared to the visual models, our approach overcomes some drastic failure modes while tracking the in-hand object pose. In our experiments, our approach shows an average increase of 55% in AUC of ADD-S and 60% in ADD, along with an 80% lower position error compared to FoundationPose.

ViTa-Zero: Zero-shot Visuotactile Object 6D Pose Estimation

TL;DR

ViTa-Zero tackles zero-shot visuotactile 6D object pose estimation by combining a visual backbone with physics-inspired feasibility checks and a test-time spring–mass refinement that leverages tactile and proprioceptive cues. The method maintains a pose estimated from vision, and refines infeasible estimates using a proximal optimization that balances tactile attraction and robot penetration penalties. Across real-world experiments with multiple backbones and tasks, ViTa-Zero yields substantial improvements over purely visual baselines (average +55% ADD-S AUC, +60% ADD, and -80% PE relative to FoundationPose) and does not require tactile data collection for fine-tuning. The work highlights practical viability for robust manipulation under occlusion and contact, paving the way for tactile-augmented pose estimation that generalizes beyond sensor-specific setups.

Abstract

Object 6D pose estimation is a critical challenge in robotics, particularly for manipulation tasks. While prior research combining visual and tactile (visuotactile) information has shown promise, these approaches often struggle with generalization due to the limited availability of visuotactile data. In this paper, we introduce ViTa-Zero, a zero-shot visuotactile pose estimation framework. Our key innovation lies in leveraging a visual model as its backbone and performing feasibility checking and test-time optimization based on physical constraints derived from tactile and proprioceptive observations. Specifically, we model the gripper-object interaction as a spring-mass system, where tactile sensors induce attractive forces, and proprioception generates repulsive forces. We validate our framework through experiments on a real-world robot setup, demonstrating its effectiveness across representative visual backbones and manipulation scenarios, including grasping, object picking, and bimanual handover. Compared to the visual models, our approach overcomes some drastic failure modes while tracking the in-hand object pose. In our experiments, our approach shows an average increase of 55% in AUC of ADD-S and 60% in ADD, along with an 80% lower position error compared to FoundationPose.

Paper Structure

This paper contains 21 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: FoundationPose wen_foundationpose_2024 (left) fails due to errors that are acute (e.g., occlusions) or accumulative (e.g., noise) while tracking the in-hand object. Our approach (right) leverages tactile and proprioceptive observations for stable tracking.
  • Figure 2: Overview of ViTa-Zero. Red fingers in the robot hand model represent activated fingertip tactile sensors.
  • Figure 3: Qualitative results. We demonstrate the performance during the object picking and bimanual handover tasks with "camera" and "eyedrop" objects. During these manipulation tasks, it is common to encounter scenarios where the object is highly occluded while moving, as illustrated in the figure. Visual approaches, like FoundationPose, can lose tracking and fail due to the absence of visual information. In contrast, our method utilizes additional tactile and proprioceptive feedback to maintain object tracking, ensuring robust performance. We note that the wrist cameras were not used in this study.
  • Figure 4: Our robot platform. Our setup consists of two Universal Robots UR 5e arms and PSYONIC Ability hands.
  • Figure 5: Comparison with FoundationPose (FP) and MegaPose (MP). The performance is measured by the AUC of ADD-S and ADD (the higher, the better) and position error (PE) (the lower, the better).