Table of Contents
Fetching ...

MAGIC: Map-Guided Few-Shot Audio-Visual Acoustics Modeling

Diwei Huang, Kunyang Lin, Peihao Chen, Qing Du, Mingkui Tan

TL;DR

MAGIC addresses the challenge of predicting room impulse responses (RIR) in novel locations from few observations by constructing acoustic-related semantic maps that encode spatial and material cues. It introduces an observation semantic map from pixel-wise visual features and a scene semantic map generated via a feature anticipation module, both feeding a transformer-based encoder-decoder that fuses map features with echo information to predict RIRs for arbitrary speaker-listener pairs. The approach is trained with STFT and energy decay matching losses, and shows superior performance over state-of-the-art methods on Matterport3D and Replica datasets, including strong generalization in unseen environments. The work advances data-efficient, spatially aware acoustic modeling with potential impact on AR/VR realism and spatial audio applications, while noting memory and real-world deployment considerations.

Abstract

Few-shot audio-visual acoustics modeling seeks to synthesize the room impulse response in arbitrary locations with few-shot observations. To sufficiently exploit the provided few-shot data for accurate acoustic modeling, we present a *map-guided* framework by constructing acoustic-related visual semantic feature maps of the scenes. Visual features preserve semantic details related to sound and maps provide explicit structural regularities of sound propagation, which are valuable for modeling environment acoustics. We thus extract pixel-wise semantic features derived from observations and project them into a top-down map, namely the **observation semantic map**. This map contains the relative positional information among points and the semantic feature information associated with each point. Yet, limited information extracted by few-shot observations on the map is not sufficient for understanding and modeling the whole scene. We address the challenge by generating a **scene semantic map** via diffusing features and anticipating the observation semantic map. The scene semantic map then interacts with echo encoding by a transformer-based encoder-decoder to predict RIR for arbitrary speaker-listener query pairs. Extensive experiments on Matterport3D and Replica dataset verify the efficacy of our framework.

MAGIC: Map-Guided Few-Shot Audio-Visual Acoustics Modeling

TL;DR

MAGIC addresses the challenge of predicting room impulse responses (RIR) in novel locations from few observations by constructing acoustic-related semantic maps that encode spatial and material cues. It introduces an observation semantic map from pixel-wise visual features and a scene semantic map generated via a feature anticipation module, both feeding a transformer-based encoder-decoder that fuses map features with echo information to predict RIRs for arbitrary speaker-listener pairs. The approach is trained with STFT and energy decay matching losses, and shows superior performance over state-of-the-art methods on Matterport3D and Replica datasets, including strong generalization in unseen environments. The work advances data-efficient, spatially aware acoustic modeling with potential impact on AR/VR realism and spatial audio applications, while noting memory and real-world deployment considerations.

Abstract

Few-shot audio-visual acoustics modeling seeks to synthesize the room impulse response in arbitrary locations with few-shot observations. To sufficiently exploit the provided few-shot data for accurate acoustic modeling, we present a *map-guided* framework by constructing acoustic-related visual semantic feature maps of the scenes. Visual features preserve semantic details related to sound and maps provide explicit structural regularities of sound propagation, which are valuable for modeling environment acoustics. We thus extract pixel-wise semantic features derived from observations and project them into a top-down map, namely the **observation semantic map**. This map contains the relative positional information among points and the semantic feature information associated with each point. Yet, limited information extracted by few-shot observations on the map is not sufficient for understanding and modeling the whole scene. We address the challenge by generating a **scene semantic map** via diffusing features and anticipating the observation semantic map. The scene semantic map then interacts with echo encoding by a transformer-based encoder-decoder to predict RIR for arbitrary speaker-listener query pairs. Extensive experiments on Matterport3D and Replica dataset verify the efficacy of our framework.
Paper Structure (27 sections, 12 equations, 5 figures, 8 tables, 1 algorithm)

This paper contains 27 sections, 12 equations, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: Illustration of few-shot audio-visual acoustics modeling and the distinguishment between existing methods and our method. Existing methods directly extract visual features from the image-wise encoder and feed them to the RIR prediction module. In contrast, our method builds a feature map that captures pixel-wise semantic context related to sound production to help acoustic modeling.
  • Figure 2: General scheme of MAGIC. MAGIC leverages U-Net pre-trained by semantic segmentation to extract the acoustic-related semantic (ARS) features and project the pixel-wise features to the observation semantic map. Then, the observation semantic map is fed into feature anticipation module to anticipate the unseen area features. The resulting scene semantic map interacts with echo encoding by a transformer-based encoder-decoder to predict RIR for arbitrary speaker-listener query pairs. We train the model minimizing the loss between the predicted RIR and target RIR.
  • Figure 3: STFT error vs. context size. Comparison of MAGIC (Ours) and Few-ShotRIR on the unseen split.
  • Figure 4: Qualitative RIR prediction. The left half is the top-down view of the scene. The right half shows the predicted RIR of MAGIC (Ours), ground truth, and Few-ShotRIR.
  • Figure 5: Architecture of feature anticipation module.