Table of Contents
Fetching ...

Segment Beyond View: Handling Partially Missing Modality for Audio-Visual Semantic Segmentation

Renjie Wu, Hu Wang, Feras Dayoub, Hsiang-Ting Chen

TL;DR

This work introduces Segment Beyond View (SBV), a framework for audio-visual semantic segmentation under partially missing modalities, specifically to identify out-of-view vehicles for pedestrian safety. SBV employs a teacher-student distillation scheme (Omni2Ego) with a vision teacher on panoramas and an 8-channel auditory teacher to guide an ego-centric student that processes first-person view and binaural audio, using AVFFM fusion and reconstruction-based auxiliary tasks. The model optimizes a combined loss of feature alignment, logits distillation, and modality reconstruction, and is evaluated on the Omni Auditory Perception Dataset, showing superior performance to state-of-the-art baselines and robustness to FoV and audio channel variations. The results suggest practical impact for AR safety, robot navigation, and autonomous driving by enabling reliable detection of hazards beyond the immediate visual field.

Abstract

Augmented Reality (AR) devices, emerging as prominent mobile interaction platforms, face challenges in user safety, particularly concerning oncoming vehicles. While some solutions leverage onboard camera arrays, these cameras often have limited field-of-view (FoV) with front or downward perspectives. Addressing this, we propose a new out-of-view semantic segmentation task and Segment Beyond View (SBV), a novel audio-visual semantic segmentation method. SBV supplements the visual modality, which miss the information beyond FoV, with the auditory information using a teacher-student distillation model (Omni2Ego). The model consists of a vision teacher utilising panoramic information, an auditory teacher with 8-channel audio, and an audio-visual student that takes views with limited FoV and binaural audio as input and produce semantic segmentation for objects outside FoV. SBV outperforms existing models in comparative evaluations and shows a consistent performance across varying FoV ranges and in monaural audio settings.

Segment Beyond View: Handling Partially Missing Modality for Audio-Visual Semantic Segmentation

TL;DR

This work introduces Segment Beyond View (SBV), a framework for audio-visual semantic segmentation under partially missing modalities, specifically to identify out-of-view vehicles for pedestrian safety. SBV employs a teacher-student distillation scheme (Omni2Ego) with a vision teacher on panoramas and an 8-channel auditory teacher to guide an ego-centric student that processes first-person view and binaural audio, using AVFFM fusion and reconstruction-based auxiliary tasks. The model optimizes a combined loss of feature alignment, logits distillation, and modality reconstruction, and is evaluated on the Omni Auditory Perception Dataset, showing superior performance to state-of-the-art baselines and robustness to FoV and audio channel variations. The results suggest practical impact for AR safety, robot navigation, and autonomous driving by enabling reliable detection of hazards beyond the immediate visual field.

Abstract

Augmented Reality (AR) devices, emerging as prominent mobile interaction platforms, face challenges in user safety, particularly concerning oncoming vehicles. While some solutions leverage onboard camera arrays, these cameras often have limited field-of-view (FoV) with front or downward perspectives. Addressing this, we propose a new out-of-view semantic segmentation task and Segment Beyond View (SBV), a novel audio-visual semantic segmentation method. SBV supplements the visual modality, which miss the information beyond FoV, with the auditory information using a teacher-student distillation model (Omni2Ego). The model consists of a vision teacher utilising panoramic information, an auditory teacher with 8-channel audio, and an audio-visual student that takes views with limited FoV and binaural audio as input and produce semantic segmentation for objects outside FoV. SBV outperforms existing models in comparative evaluations and shows a consistent performance across varying FoV ranges and in monaural audio settings.
Paper Structure (20 sections, 7 equations, 5 figures, 3 tables)

This paper contains 20 sections, 7 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Pedestrian in the image can only see objects in the field of view (FoV) but hear the oncoming out-of-view vehicle and determine its general location and what kind of vehicle it is. The right image describes our novel task, with only FoV and binaural audio, the model can semantically segment the in- and out-of-view vehicles in the panorama.
  • Figure 2: Description of binaural audio selected by the left and right rotation of the head. The number of left turns negative, and the number of right turns positive.
  • Figure 3: Segment Beyond View training architecture consists of a vision teacher, an auditory teacher, and an audio-visual student. The student takes first-person view and binaural audio as inputs. The input of the vision teacher is panoramas and input of the auditory teacher is the 8-channel audio. Enc: Encoder; Dec: Decoder; Seg: Segmentation Head; Rec: Reconstruction.
  • Figure 4: mIoU (%) results for different field of view sizes.
  • Figure 5: Background, input, ground truth, results of TPAVI zhou2023avss, ours (SBV) under different weather conditions. Light areas with green lines are first-person views. Yellow boxes mark the out-of-view differences between segmentation results.