Table of Contents
Fetching ...

Visual-auditory Extrinsic Contact Estimation

Xili Yi, Jayjun Lee, Nima Fazeli

TL;DR

The paper tackles the challenge of estimating extrinsic contacts between a grasped object and its environment under occlusions by fusing vision with active audio sensing. It introduces VA2Contact, a multimodal model built on a three-stream UNet that processes depth, optical flow, and a log-mel spectrogram derived from a 1s sweep, with proprioception fused at the bottleneck to produce per-pixel contact probability maps. A real-to-sim audio hallucination strategy injects real-world audio into simulated data to enable zero-shot sim-to-real transfer, achieving accurate contact location and patch geometry across cluttered scenes and improving policy learning for contact-rich manipulation tasks. The approach demonstrates robust performance under occlusions, outperforms vision-only baselines, and enhances a wiping task when contact cues are explicitly incorporated into the policy, underscoring the practical significance of multimodal extrinsic contact perception in manipulation.

Abstract

Robust manipulation often hinges on a robot's ability to perceive extrinsic contacts-contacts between a grasped object and its surrounding environment. However, these contacts are difficult to observe through vision alone due to occlusions, limited resolution, and ambiguous near-contact states. In this paper, we propose a visual-auditory method for extrinsic contact estimation that integrates global scene information from vision with local contact cues obtained through active audio sensing. Our approach equips a robotic gripper with contact microphones and conduction speakers, enabling the system to emit and receive acoustic signals through the grasped object to detect external contacts. We train our perception pipeline entirely in simulation and zero-shot transfer to the real world. To bridge the sim-to-real gap, we introduce a real-to-sim audio hallucination technique, injecting real-world audio samples into simulated scenes with ground-truth contact labels. The resulting multimodal model accurately estimates both the location and size of extrinsic contacts across a range of cluttered and occluded scenarios. We further demonstrate that explicit contact prediction significantly improves policy learning for downstream contact-rich manipulation tasks.

Visual-auditory Extrinsic Contact Estimation

TL;DR

The paper tackles the challenge of estimating extrinsic contacts between a grasped object and its environment under occlusions by fusing vision with active audio sensing. It introduces VA2Contact, a multimodal model built on a three-stream UNet that processes depth, optical flow, and a log-mel spectrogram derived from a 1s sweep, with proprioception fused at the bottleneck to produce per-pixel contact probability maps. A real-to-sim audio hallucination strategy injects real-world audio into simulated data to enable zero-shot sim-to-real transfer, achieving accurate contact location and patch geometry across cluttered scenes and improving policy learning for contact-rich manipulation tasks. The approach demonstrates robust performance under occlusions, outperforms vision-only baselines, and enhances a wiping task when contact cues are explicitly incorporated into the policy, underscoring the practical significance of multimodal extrinsic contact perception in manipulation.

Abstract

Robust manipulation often hinges on a robot's ability to perceive extrinsic contacts-contacts between a grasped object and its surrounding environment. However, these contacts are difficult to observe through vision alone due to occlusions, limited resolution, and ambiguous near-contact states. In this paper, we propose a visual-auditory method for extrinsic contact estimation that integrates global scene information from vision with local contact cues obtained through active audio sensing. Our approach equips a robotic gripper with contact microphones and conduction speakers, enabling the system to emit and receive acoustic signals through the grasped object to detect external contacts. We train our perception pipeline entirely in simulation and zero-shot transfer to the real world. To bridge the sim-to-real gap, we introduce a real-to-sim audio hallucination technique, injecting real-world audio samples into simulated scenes with ground-truth contact labels. The resulting multimodal model accurately estimates both the location and size of extrinsic contacts across a range of cluttered and occluded scenarios. We further demonstrate that explicit contact prediction significantly improves policy learning for downstream contact-rich manipulation tasks.
Paper Structure (8 sections, 7 figures, 2 tables)

This paper contains 8 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: (a) Our proposed fingers with an active conduction speaker and contact microphone emitting and receiving sound through the object. The absorption and reflection of the audio from the contact between the object and environment enables extrinsic contact estimation despite visual ambiguities. Challenges include (b) where objects occlude the contact between the box and the table, (c) where a different surface type can change the acoustic feedback, and (d) estimating the object's contact status for near-contact scenarios.
  • Figure 2: System Architecture.Visual-Auditory Extrinsic Contact Estimation (VA2Contact) is trained in simulation with real-world audio collected through our active-audio sensing mechanism. Raw audio waveforms are processed with Short-Time Fourier Transform (STFT). VA2Contact can zero-shot transfer to the real-world for contact prediction tasks. Here, the output contact probability map is overlaid onto the full depth image of the scene. Note the usage of off-the-shelf metric depth estimation models (Depth-Pro bochkovskii2024depth) and optical flow estimation model (RAFT teed2020raft) both for scalable sim-based training and real-world inference, to bridge the sim-to-real gap effectively. VA2Contact unlocks contact perception under occlusions for contact-rich tool manipulation tasks such as wiping, which we demonstrate through real-world policy learning experiments.
  • Figure 3: Active-audio Sensing. (1). A sweeping acoustic signal is generated from (2) using conduction speaker finger, where (3) the sound propagates through the object and vibrates with any extrinsic contact it makes, and (4) which is received at the contact microphone finger, and (5) the audio waveform is converted to a spectrogram.
  • Figure 4: Different contact modes learned from simulation data. The cropped depth is centered around the projected pixel coordinate of the EE pose. Sample predictions from test simulation data are shown as contact probability maps overlaid on top of scene depth.
  • Figure 5: Sim-to-Real Transfer and Real-world Extrinsic Contact Predictions for S1. The RGB-D images with GT contact probability masks are wrapped in red. Three models, VA2Contact, VA2Contact w/o optical flow, and Im2Contact, are tested where the results are shown per column. VA2Contact's contact probability predictions are overlaid to depth images, wrapped in green. The color bar represents contact probability (0.0 $\leftarrow$ | $\rightarrow$ 1.0). All grasped objects used for real-world testing are unseen geometries (a cup, pear, dustpan, box, blue sponge, lemon, can, clamp). VA2Contact is able to zero-shot predict diverse contact types over objects with varying properties.
  • ...and 2 more figures