Table of Contents
Fetching ...

The Audio-Visual BatVision Dataset for Research on Sight and Sound

Amandine Brunetto, Sascha Hornauer, Stella X. Yu, Fabien Moutarde

TL;DR

This dataset will allow research on robot echolocation, general audio-visual tasks and sound phænomena unavailable in simulated data, and show how state-of-the-art work developed for simulated data can also succeed on this dataset.

Abstract

Vision research showed remarkable success in understanding our world, propelled by datasets of images and videos. Sensor data from radar, LiDAR and cameras supports research in robotics and autonomous driving for at least a decade. However, while visual sensors may fail in some conditions, sound has recently shown potential to complement sensor data. Simulated room impulse responses (RIR) in 3D apartment-models became a benchmark dataset for the community, fostering a range of audiovisual research. In simulation, depth is predictable from sound, by learning bat-like perception with a neural network. Concurrently, the same was achieved in reality by using RGB-D images and echoes of chirping sounds. Biomimicking bat perception is an exciting new direction but needs dedicated datasets to explore the potential. Therefore, we collected the BatVision dataset to provide large-scale echoes in complex real-world scenes to the community. We equipped a robot with a speaker to emit chirps and a binaural microphone to record their echoes. Synchronized RGB-D images from the same perspective provide visual labels of traversed spaces. We sampled modern US office spaces to historic French university grounds, indoor and outdoor with large architectural variety. This dataset will allow research on robot echolocation, general audio-visual tasks and sound phænomena unavailable in simulated data. We show promising results for audio-only depth prediction and show how state-of-the-art work developed for simulated data can also succeed on our dataset. Project page: https://amandinebtto.github.io/Batvision-Dataset/

The Audio-Visual BatVision Dataset for Research on Sight and Sound

TL;DR

This dataset will allow research on robot echolocation, general audio-visual tasks and sound phænomena unavailable in simulated data, and show how state-of-the-art work developed for simulated data can also succeed on this dataset.

Abstract

Vision research showed remarkable success in understanding our world, propelled by datasets of images and videos. Sensor data from radar, LiDAR and cameras supports research in robotics and autonomous driving for at least a decade. However, while visual sensors may fail in some conditions, sound has recently shown potential to complement sensor data. Simulated room impulse responses (RIR) in 3D apartment-models became a benchmark dataset for the community, fostering a range of audiovisual research. In simulation, depth is predictable from sound, by learning bat-like perception with a neural network. Concurrently, the same was achieved in reality by using RGB-D images and echoes of chirping sounds. Biomimicking bat perception is an exciting new direction but needs dedicated datasets to explore the potential. Therefore, we collected the BatVision dataset to provide large-scale echoes in complex real-world scenes to the community. We equipped a robot with a speaker to emit chirps and a binaural microphone to record their echoes. Synchronized RGB-D images from the same perspective provide visual labels of traversed spaces. We sampled modern US office spaces to historic French university grounds, indoor and outdoor with large architectural variety. This dataset will allow research on robot echolocation, general audio-visual tasks and sound phænomena unavailable in simulated data. We show promising results for audio-only depth prediction and show how state-of-the-art work developed for simulated data can also succeed on our dataset. Project page: https://amandinebtto.github.io/Batvision-Dataset/
Paper Structure (8 sections, 7 figures, 2 tables)

This paper contains 8 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Example scenes from BV1 (left two columns) and BV2 (right four columns). BV1 contains typical office scenes with many corridors and some open spaces. BV2 columns show a wide variety of corridors, with and without carpet, maintenance areas, antique conference rooms and outdoor scenes.
  • Figure 2: Recording robot used for BV1 at UC Berkeley (left) and BV2 at Mines Paris (right). Both record binaural audio and RGB-D images yet the hardware setup differs.
  • Figure 3: Histogram of average depth per instance. BV2 depth distribution is more long-tailed than BV1, which is consistent with the variety of data.
  • Figure 4: Average depth per pixel. In BV1 corridors are often centered, in BV2 depth distribution is more complex.
  • Figure 5: Test set results when training Beyond Image to Depth on BatVision V1 (a) and BatVision V2 (b), depth in meters. Same hyperparameters as in simulation show well visible general layout and obstacles. Fifth row shows free space between two desks even based on audio-only. Fine structures such as cables are still hard to reconstruct.
  • ...and 2 more figures