Autonomous Soundscape Augmentation with Multimodal Fusion of Visual and Participant-linked Inputs

Kenneth Ooi; Karn N. Watcharasupat; Bhan Lam; Zhen-Ting Ong; Woon-Seng Gan

Autonomous Soundscape Augmentation with Multimodal Fusion of Visual and Participant-linked Inputs

Kenneth Ooi, Karn N. Watcharasupat, Bhan Lam, Zhen-Ting Ong, Woon-Seng Gan

TL;DR

This work addresses perception-driven soundscape augmentation by incorporating context beyond acoustics. It introduces contextual PPAP (cPPAP), a multimodal extension of the probabilistic perceptual predictor that fuses acoustic, visual, and participant-linked inputs via three fusion schemes (EF, MF, LF) in an attention-based DNN, predicting the distribution parameters $\mu$ and $\log \sigma$ for ISO Pleasantness. On the ARAUS dataset, all-modal cPPAP improves the mean squared error to $0.1194\pm0.0012$ compared with the audio-only baseline of $0.1217\pm0.0009$, with late fusion on the ip+ev configuration delivering the best $MSE$ of $0.1183\pm0.0011$, and the approach enables explainability by simulating participant effects. The results demonstrate the value of context-aware, multimodal inputs for autonomous masker selection and perceptual modeling, with future work extending to additional contextual signals and in-situ validation.

Abstract

Autonomous soundscape augmentation systems typically use trained models to pick optimal maskers to effect a desired perceptual change. While acoustic information is paramount to such systems, contextual information, including participant demographics and the visual environment, also influences acoustic perception. Hence, we propose modular modifications to an existing attention-based deep neural network, to allow early, mid-level, and late feature fusion of participant-linked, visual, and acoustic features. Ablation studies on module configurations and corresponding fusion methods using the ARAUS dataset show that contextual features improve the model performance in a statistically significant manner on the normalized ISO Pleasantness, to a mean squared error of $0.1194\pm0.0012$ for the best-performing all-modality model, against $0.1217\pm0.0009$ for the audio-only model. Soundscape augmentation systems can thereby leverage multimodal inputs for improved performance. We also investigate the impact of individual participant-linked factors using trained models to illustrate improvements in model explainability.

Autonomous Soundscape Augmentation with Multimodal Fusion of Visual and Participant-linked Inputs

TL;DR

and

for ISO Pleasantness. On the ARAUS dataset, all-modal cPPAP improves the mean squared error to

compared with the audio-only baseline of

, with late fusion on the ip+ev configuration delivering the best

, and the approach enables explainability by simulating participant effects. The results demonstrate the value of context-aware, multimodal inputs for autonomous masker selection and perceptual modeling, with future work extending to additional contextual signals and in-situ validation.

Abstract

for the best-performing all-modality model, against

for the audio-only model. Soundscape augmentation systems can thereby leverage multimodal inputs for improved performance. We also investigate the impact of individual participant-linked factors using trained models to illustrate improvements in model explainability.

Paper Structure (13 sections, 5 equations, 2 figures, 1 table)

This paper contains 13 sections, 5 equations, 2 figures, 1 table.

Introduction
Related Work
Proposed Method
Audio-only PPAP (aPPAP)
Contextual PPAP (cPPAP)
Early fusion (ef)
Mid-level fusion (mf)
Late fusion (lf)
Validation Experiments
Dataset
Model architecture and training
Results and Discussion
Conclusion

Figures (2)

Figure 1: Architecture of audio-only and contextual PPAP. Switches indicate the different configurations of the PPAP used for our validation experiments. Abbreviations: ip/ep = include/exclude $\boldsymbol{h}$; iv/ev = include/exclude $\boldsymbol{r}$; ef/mf/lf = early/mid-level/late fusion.
Figure 2: Mean isoPl predictions by the cPPAP (ip+iv+ef variant, seed 2) across all ARAUS dataset samples as a function of $[0,1]$-normalized PIQ items used in \ref{['sec:Validation Experiments']}. Faded vertical lines denote the mean values of the same PIQ items within the ARAUS dataset.

Autonomous Soundscape Augmentation with Multimodal Fusion of Visual and Participant-linked Inputs

TL;DR

Abstract

Autonomous Soundscape Augmentation with Multimodal Fusion of Visual and Participant-linked Inputs

Authors

TL;DR

Abstract

Table of Contents

Figures (2)