Table of Contents
Fetching ...

AV-PedAware: Self-Supervised Audio-Visual Fusion for Dynamic Pedestrian Awareness

Yizhuo Yang, Shenghai Yuan, Muqing Cao, Jianfei Yang, Lihua Xie

TL;DR

AV-PedAware tackles the problem of reliable 3D pedestrian awareness for robotics using low-cost sensors by proposing a self-supervised audio-visual fusion framework. The approach employs a teacher-student architecture where a pretrained 3D detector provides pseudo-labels to supervise an audio-visual network that fuses image features with multi-microphone audio using an attention mechanism, enhanced by a multi-task segmentation auxiliary task. Its three main contributions are a novel cross-modal self-supervised network for 3D pedestrian detection, a multimodal attention-based fusion strategy, and a new multimodal dataset integrating point cloud, RGB, and audio data. Experimental results show AV-PedAware achieves comparable or superior 3D detection performance to LIDAR-based systems at a fraction of the cost and remains robust under dark conditions, highlighting its practical impact for affordable, robust robotics perception. The work also provides public dataset and code to foster further research in multimodal robotic perception.

Abstract

In this study, we introduce AV-PedAware, a self-supervised audio-visual fusion system designed to improve dynamic pedestrian awareness for robotics applications. Pedestrian awareness is a critical requirement in many robotics applications. However, traditional approaches that rely on cameras and LIDARs to cover multiple views can be expensive and susceptible to issues such as changes in illumination, occlusion, and weather conditions. Our proposed solution replicates human perception for 3D pedestrian detection using low-cost audio and visual fusion. This study represents the first attempt to employ audio-visual fusion to monitor footstep sounds for the purpose of predicting the movements of pedestrians in the vicinity. The system is trained through self-supervised learning based on LIDAR-generated labels, making it a cost-effective alternative to LIDAR-based pedestrian awareness. AV-PedAware achieves comparable results to LIDAR-based systems at a fraction of the cost. By utilizing an attention mechanism, it can handle dynamic lighting and occlusions, overcoming the limitations of traditional LIDAR and camera-based systems. To evaluate our approach's effectiveness, we collected a new multimodal pedestrian detection dataset and conducted experiments that demonstrate the system's ability to provide reliable 3D detection results using only audio and visual data, even in extreme visual conditions. We will make our collected dataset and source code available online for the community to encourage further development in the field of robotics perception systems.

AV-PedAware: Self-Supervised Audio-Visual Fusion for Dynamic Pedestrian Awareness

TL;DR

AV-PedAware tackles the problem of reliable 3D pedestrian awareness for robotics using low-cost sensors by proposing a self-supervised audio-visual fusion framework. The approach employs a teacher-student architecture where a pretrained 3D detector provides pseudo-labels to supervise an audio-visual network that fuses image features with multi-microphone audio using an attention mechanism, enhanced by a multi-task segmentation auxiliary task. Its three main contributions are a novel cross-modal self-supervised network for 3D pedestrian detection, a multimodal attention-based fusion strategy, and a new multimodal dataset integrating point cloud, RGB, and audio data. Experimental results show AV-PedAware achieves comparable or superior 3D detection performance to LIDAR-based systems at a fraction of the cost and remains robust under dark conditions, highlighting its practical impact for affordable, robust robotics perception. The work also provides public dataset and code to foster further research in multimodal robotic perception.

Abstract

In this study, we introduce AV-PedAware, a self-supervised audio-visual fusion system designed to improve dynamic pedestrian awareness for robotics applications. Pedestrian awareness is a critical requirement in many robotics applications. However, traditional approaches that rely on cameras and LIDARs to cover multiple views can be expensive and susceptible to issues such as changes in illumination, occlusion, and weather conditions. Our proposed solution replicates human perception for 3D pedestrian detection using low-cost audio and visual fusion. This study represents the first attempt to employ audio-visual fusion to monitor footstep sounds for the purpose of predicting the movements of pedestrians in the vicinity. The system is trained through self-supervised learning based on LIDAR-generated labels, making it a cost-effective alternative to LIDAR-based pedestrian awareness. AV-PedAware achieves comparable results to LIDAR-based systems at a fraction of the cost. By utilizing an attention mechanism, it can handle dynamic lighting and occlusions, overcoming the limitations of traditional LIDAR and camera-based systems. To evaluate our approach's effectiveness, we collected a new multimodal pedestrian detection dataset and conducted experiments that demonstrate the system's ability to provide reliable 3D detection results using only audio and visual data, even in extreme visual conditions. We will make our collected dataset and source code available online for the community to encourage further development in the field of robotics perception systems.

Paper Structure

This paper contains 14 sections, 4 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Our proposed method fully exploits the complementary information between audio and visual modalities, enabling 360-degree pedestrian detection.
  • Figure 2: The audio-visual pedestrian detection network adopts a teacher-student architecture. A well-trained PointRCNN is used to generate 3D bounding box from point cloud data to supervise the learning of audio-visual network. During inference, the network can obtain the 3D bounding box of the pedestrian using only audio and image data.
  • Figure 3: The Structure of Audionet. Two types of convolutional kernels are used to extract time and frequency features from the input spectrogram
  • Figure 4: The multi-modality data collection suit used in our experiment.
  • Figure 5: The visualization results of the proposed method. The first row presents the detection results in the case of the pedestrian in line of sight of the camera, while the third row shows the results in the dark environment and when the pedestrian is out of view. The red bounding box is the ground truth, while the green box is the predicted result.