Visual-auditory Extrinsic Contact Estimation
Xili Yi, Jayjun Lee, Nima Fazeli
TL;DR
The paper tackles the challenge of estimating extrinsic contacts between a grasped object and its environment under occlusions by fusing vision with active audio sensing. It introduces VA2Contact, a multimodal model built on a three-stream UNet that processes depth, optical flow, and a log-mel spectrogram derived from a 1s sweep, with proprioception fused at the bottleneck to produce per-pixel contact probability maps. A real-to-sim audio hallucination strategy injects real-world audio into simulated data to enable zero-shot sim-to-real transfer, achieving accurate contact location and patch geometry across cluttered scenes and improving policy learning for contact-rich manipulation tasks. The approach demonstrates robust performance under occlusions, outperforms vision-only baselines, and enhances a wiping task when contact cues are explicitly incorporated into the policy, underscoring the practical significance of multimodal extrinsic contact perception in manipulation.
Abstract
Robust manipulation often hinges on a robot's ability to perceive extrinsic contacts-contacts between a grasped object and its surrounding environment. However, these contacts are difficult to observe through vision alone due to occlusions, limited resolution, and ambiguous near-contact states. In this paper, we propose a visual-auditory method for extrinsic contact estimation that integrates global scene information from vision with local contact cues obtained through active audio sensing. Our approach equips a robotic gripper with contact microphones and conduction speakers, enabling the system to emit and receive acoustic signals through the grasped object to detect external contacts. We train our perception pipeline entirely in simulation and zero-shot transfer to the real world. To bridge the sim-to-real gap, we introduce a real-to-sim audio hallucination technique, injecting real-world audio samples into simulated scenes with ground-truth contact labels. The resulting multimodal model accurately estimates both the location and size of extrinsic contacts across a range of cluttered and occluded scenarios. We further demonstrate that explicit contact prediction significantly improves policy learning for downstream contact-rich manipulation tasks.
