Multimodal sensor fusion in the latent representation space
Robert J. Piechocki, Xiaoyang Wang, Mohammud J. Bocus
TL;DR
The paper tackles robust multimodal sensor fusion under incomplete observations by formulating a two-stage approach: first, train a self-supervised Multimodal Variational Autoencoder (MVAE) to approximate the joint distribution $p(z,x_{1:M})$, and second, treat the MVAE as a reconstruction prior and search manifold to compute the MAP latent cause $\hat{z}_{MAP}$ from subsampled observations via SGD through differentiable samplers $\chi_m$ and decoders $\psi_m$ with loss $\mathcal{L}(z)=\lambda_0\|z\|^2+\sum_m\lambda_m\|y_m-\chi_m(\psi_m(z))\|^2$. The method supports missing modalities via PoE posterior approximations and enables both multisensory classification and reconstruction even under compressed sensing, noise, or missing data. Empirical results on passive WiFi CSI spectrograms for HAR and synthetic toy proteins show the approach outperforms traditional feature- or decision-level fusion, maintains performance under noise and severe data loss, and benefits from asymmetric CS where a strong modality assists a weak one. These findings highlight latent-space fusion as a practical route for robust, label-efficient multimodal sensing in real-world settings.
Abstract
A new method for multimodal sensor fusion is introduced. The technique relies on a two-stage process. In the first stage, a multimodal generative model is constructed from unlabelled training data. In the second stage, the generative model serves as a reconstruction prior and the search manifold for the sensor fusion tasks. The method also handles cases where observations are accessed only via subsampling i.e. compressed sensing. We demonstrate the effectiveness and excellent performance on a range of multimodal fusion experiments such as multisensory classification, denoising, and recovery from subsampled observations.
