Table of Contents
Fetching ...

See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation

Hao Li, Yizhi Zhang, Junzhe Zhu, Shaoxiong Wang, Michelle A Lee, Huazhe Xu, Edward Adelson, Li Fei-Fei, Ruohan Gao, Jiajun Wu

TL;DR

This work demonstrates that integrating vision, audio, and tactile sensing through a multisensory self-attention model (MulSA) significantly enhances robotic manipulation. By collecting synchronized visual, acoustic, and GelSight tactile data and employing cross-modality and cross-time attention, the approach outperforms baselines on dense packing and pouring tasks. The study reveals distinct modality roles—vision for global context, audio for instantaneous events, and touch for local geometry—highlighting the value of tri-modal fusion. These findings advance multisensory robotics and suggest directions for reinforcement learning and versatile hardware to further generalize across tasks.

Abstract

Humans use all of their senses to accomplish different tasks in everyday activities. In contrast, existing work on robotic manipulation mostly relies on one, or occasionally two modalities, such as vision and touch. In this work, we systematically study how visual, auditory, and tactile perception can jointly help robots to solve complex manipulation tasks. We build a robot system that can see with a camera, hear with a contact microphone, and feel with a vision-based tactile sensor, with all three sensory modalities fused with a self-attention model. Results on two challenging tasks, dense packing and pouring, demonstrate the necessity and power of multisensory perception for robotic manipulation: vision displays the global status of the robot but can often suffer from occlusion, audio provides immediate feedback of key moments that are even invisible, and touch offers precise local geometry for decision making. Leveraging all three modalities, our robotic system significantly outperforms prior methods.

See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation

TL;DR

This work demonstrates that integrating vision, audio, and tactile sensing through a multisensory self-attention model (MulSA) significantly enhances robotic manipulation. By collecting synchronized visual, acoustic, and GelSight tactile data and employing cross-modality and cross-time attention, the approach outperforms baselines on dense packing and pouring tasks. The study reveals distinct modality roles—vision for global context, audio for instantaneous events, and touch for local geometry—highlighting the value of tri-modal fusion. These findings advance multisensory robotics and suggest directions for reinforcement learning and versatile hardware to further generalize across tasks.

Abstract

Humans use all of their senses to accomplish different tasks in everyday activities. In contrast, existing work on robotic manipulation mostly relies on one, or occasionally two modalities, such as vision and touch. In this work, we systematically study how visual, auditory, and tactile perception can jointly help robots to solve complex manipulation tasks. We build a robot system that can see with a camera, hear with a contact microphone, and feel with a vision-based tactile sensor, with all three sensory modalities fused with a self-attention model. Results on two challenging tasks, dense packing and pouring, demonstrate the necessity and power of multisensory perception for robotic manipulation: vision displays the global status of the robot but can often suffer from occlusion, audio provides immediate feedback of key moments that are even invisible, and touch offers precise local geometry for decision making. Leveraging all three modalities, our robotic system significantly outperforms prior methods.
Paper Structure (18 sections, 5 figures, 3 tables)

This paper contains 18 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Illustration of the real-world dense packing task that we tackle, where the robot needs to insert a glass into the densely cluttered box by leveraging multisensory feedback.
  • Figure 2: Our multisensory robot learning framework. The visual, acoustic, and tactile data from the corresponding sensors (left) are processed and fused by our multisensory self-attention model (middle) to predict a task-specific action for the dense packing task and the pouring task (right).
  • Figure 3: Multisensory self-attention.
  • Figure 4: Illustration of our task setup for the dense packing task and the pouring task.
  • Figure 5: Visualization of the aggregated attention scores for each modality as the robot completes (a) the dense packing task and (b) the pouring task in two test trials.