Table of Contents
Fetching ...

Hearing Touch: Audio-Visual Pretraining for Contact-Rich Manipulation

Jared Mejia, Victoria Dean, Tess Hellebrekers, Abhinav Gupta

TL;DR

This work addresses the scarcity of large-scale tactile pretraining by leveraging contact microphones to convert tactile signals into audio, enabling the use of large-scale audio-visual pretraining. The authors initialize an audio encoder with AVID pretraining on Audioset and a visual encoder with R3M, then train a multisensory policy via behavior cloning that fuses modalities with a transformer. Across three real-world tasks in a low-data regime, the approach improves over vision-only baselines and outperforms audio-trained-from-scratch variants, demonstrating strong generalization to novel visual conditions. The study highlights the potential of cross-domain multisensory pretraining to enhance robotic manipulation when tactile data scales are limited and points to future work on richer visuo-t tactile integrations and dataset design.

Abstract

Although pre-training on a large amount of data is beneficial for robot learning, current paradigms only perform large-scale pretraining for visual representations, whereas representations for other modalities are trained from scratch. In contrast to the abundance of visual data, it is unclear what relevant internet-scale data may be used for pretraining other modalities such as tactile sensing. Such pretraining becomes increasingly crucial in the low-data regimes common in robotics applications. In this paper, we address this gap by using contact microphones as an alternative tactile sensor. Our key insight is that contact microphones capture inherently audio-based information, allowing us to leverage large-scale audio-visual pretraining to obtain representations that boost the performance of robotic manipulation. To the best of our knowledge, our method is the first approach leveraging large-scale multisensory pre-training for robotic manipulation. For supplementary information including videos of real robot experiments, please see https://sites.google.com/view/hearing-touch.

Hearing Touch: Audio-Visual Pretraining for Contact-Rich Manipulation

TL;DR

This work addresses the scarcity of large-scale tactile pretraining by leveraging contact microphones to convert tactile signals into audio, enabling the use of large-scale audio-visual pretraining. The authors initialize an audio encoder with AVID pretraining on Audioset and a visual encoder with R3M, then train a multisensory policy via behavior cloning that fuses modalities with a transformer. Across three real-world tasks in a low-data regime, the approach improves over vision-only baselines and outperforms audio-trained-from-scratch variants, demonstrating strong generalization to novel visual conditions. The study highlights the potential of cross-domain multisensory pretraining to enhance robotic manipulation when tactile data scales are limited and points to future work on richer visuo-t tactile integrations and dataset design.

Abstract

Although pre-training on a large amount of data is beneficial for robot learning, current paradigms only perform large-scale pretraining for visual representations, whereas representations for other modalities are trained from scratch. In contrast to the abundance of visual data, it is unclear what relevant internet-scale data may be used for pretraining other modalities such as tactile sensing. Such pretraining becomes increasingly crucial in the low-data regimes common in robotics applications. In this paper, we address this gap by using contact microphones as an alternative tactile sensor. Our key insight is that contact microphones capture inherently audio-based information, allowing us to leverage large-scale audio-visual pretraining to obtain representations that boost the performance of robotic manipulation. To the best of our knowledge, our method is the first approach leveraging large-scale multisensory pre-training for robotic manipulation. For supplementary information including videos of real robot experiments, please see https://sites.google.com/view/hearing-touch.
Paper Structure (28 sections, 6 figures, 1 table)

This paper contains 28 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Hearing touch: We enable multisensory pretraining for manipulation by transferring audio-visual representations to manipulation tasks using vision and contact audio.
  • Figure 2: Two-stage model training. AVID and R3M pretraining leverages the large scale of internet video data (blue dashed box). We initialize the vision and audio encoders with the resulting pre-trained representations and then train the entire policy end-to-end with behavior cloning from a small number of in-domain demonstrations. The policy takes image and spectrogram inputs (left) and outputs a sequence of actions in delta end effector space (right).
  • Figure 3: Hardware and task setup. We attach the Piezo contact microphones to our gripper to record vibrations in the form of audio and run experiments on three real-world tasks with significant visual differences between train and test settings.
  • Figure 4: Success rates across methods and tasks. Our method, shown in blue, outperforms baselines in all but one setup of the zipping task. Furthermore, our method displays much less variation in performance between different configurations of each task, showcasing an increase in the ability to generalize to drastic visual differences as a result of learning useful audio representations.
  • Figure 5: t-SNE 2D projection. For comparative analysis of the learned embedding spaces, we visualize projections of the learned representations from each method in each variation of the flipping task. Lighter hues indicate the starting points and darker hues indicate the end points of the trajectories. Please see the video on our {https://sites.google.com/view/hearing-touch} for a better visualization.
  • ...and 1 more figures