MAGIC: Map-Guided Few-Shot Audio-Visual Acoustics Modeling
Diwei Huang, Kunyang Lin, Peihao Chen, Qing Du, Mingkui Tan
TL;DR
MAGIC addresses the challenge of predicting room impulse responses (RIR) in novel locations from few observations by constructing acoustic-related semantic maps that encode spatial and material cues. It introduces an observation semantic map from pixel-wise visual features and a scene semantic map generated via a feature anticipation module, both feeding a transformer-based encoder-decoder that fuses map features with echo information to predict RIRs for arbitrary speaker-listener pairs. The approach is trained with STFT and energy decay matching losses, and shows superior performance over state-of-the-art methods on Matterport3D and Replica datasets, including strong generalization in unseen environments. The work advances data-efficient, spatially aware acoustic modeling with potential impact on AR/VR realism and spatial audio applications, while noting memory and real-world deployment considerations.
Abstract
Few-shot audio-visual acoustics modeling seeks to synthesize the room impulse response in arbitrary locations with few-shot observations. To sufficiently exploit the provided few-shot data for accurate acoustic modeling, we present a *map-guided* framework by constructing acoustic-related visual semantic feature maps of the scenes. Visual features preserve semantic details related to sound and maps provide explicit structural regularities of sound propagation, which are valuable for modeling environment acoustics. We thus extract pixel-wise semantic features derived from observations and project them into a top-down map, namely the **observation semantic map**. This map contains the relative positional information among points and the semantic feature information associated with each point. Yet, limited information extracted by few-shot observations on the map is not sufficient for understanding and modeling the whole scene. We address the challenge by generating a **scene semantic map** via diffusing features and anticipating the observation semantic map. The scene semantic map then interacts with echo encoding by a transformer-based encoder-decoder to predict RIR for arbitrary speaker-listener query pairs. Extensive experiments on Matterport3D and Replica dataset verify the efficacy of our framework.
