Table of Contents
Fetching ...

Multimodal sensor fusion in the latent representation space

Robert J. Piechocki, Xiaoyang Wang, Mohammud J. Bocus

TL;DR

The paper tackles robust multimodal sensor fusion under incomplete observations by formulating a two-stage approach: first, train a self-supervised Multimodal Variational Autoencoder (MVAE) to approximate the joint distribution $p(z,x_{1:M})$, and second, treat the MVAE as a reconstruction prior and search manifold to compute the MAP latent cause $\hat{z}_{MAP}$ from subsampled observations via SGD through differentiable samplers $\chi_m$ and decoders $\psi_m$ with loss $\mathcal{L}(z)=\lambda_0\|z\|^2+\sum_m\lambda_m\|y_m-\chi_m(\psi_m(z))\|^2$. The method supports missing modalities via PoE posterior approximations and enables both multisensory classification and reconstruction even under compressed sensing, noise, or missing data. Empirical results on passive WiFi CSI spectrograms for HAR and synthetic toy proteins show the approach outperforms traditional feature- or decision-level fusion, maintains performance under noise and severe data loss, and benefits from asymmetric CS where a strong modality assists a weak one. These findings highlight latent-space fusion as a practical route for robust, label-efficient multimodal sensing in real-world settings.

Abstract

A new method for multimodal sensor fusion is introduced. The technique relies on a two-stage process. In the first stage, a multimodal generative model is constructed from unlabelled training data. In the second stage, the generative model serves as a reconstruction prior and the search manifold for the sensor fusion tasks. The method also handles cases where observations are accessed only via subsampling i.e. compressed sensing. We demonstrate the effectiveness and excellent performance on a range of multimodal fusion experiments such as multisensory classification, denoising, and recovery from subsampled observations.

Multimodal sensor fusion in the latent representation space

TL;DR

The paper tackles robust multimodal sensor fusion under incomplete observations by formulating a two-stage approach: first, train a self-supervised Multimodal Variational Autoencoder (MVAE) to approximate the joint distribution , and second, treat the MVAE as a reconstruction prior and search manifold to compute the MAP latent cause from subsampled observations via SGD through differentiable samplers and decoders with loss . The method supports missing modalities via PoE posterior approximations and enables both multisensory classification and reconstruction even under compressed sensing, noise, or missing data. Empirical results on passive WiFi CSI spectrograms for HAR and synthetic toy proteins show the approach outperforms traditional feature- or decision-level fusion, maintains performance under noise and severe data loss, and benefits from asymmetric CS where a strong modality assists a weak one. These findings highlight latent-space fusion as a practical route for robust, label-efficient multimodal sensing in real-world settings.

Abstract

A new method for multimodal sensor fusion is introduced. The technique relies on a two-stage process. In the first stage, a multimodal generative model is constructed from unlabelled training data. In the second stage, the generative model serves as a reconstruction prior and the search manifold for the sensor fusion tasks. The method also handles cases where observations are accessed only via subsampling i.e. compressed sensing. We demonstrate the effectiveness and excellent performance on a range of multimodal fusion experiments such as multisensory classification, denoising, and recovery from subsampled observations.
Paper Structure (7 sections, 12 equations, 11 figures, 9 tables, 1 algorithm)

This paper contains 7 sections, 12 equations, 11 figures, 9 tables, 1 algorithm.

Figures (11)

  • Figure 1: Multimodal Sensor Fusion: (a) Decision fusion, (b) Feature fusion, (c) Our technique: fusion in the latent representation with optional compressed sensing measurements; $F$ features, $p(z)$ prior model, $\bf{G}$ generators, $X$ complete data, $Y$ subsampled data. For clarity $M=2$ modalities are shown, the concept generalises to any $M$.
  • Figure 2: (a) Generated toy proteins examples ($N=64$) and (b) reconstruction from compressed sensing observations. With 2 out of 64 measurements (3.125%), near perfect reconstruction is possible even though the modalities are individually subsampled.
  • Figure 3: Illustration of spectrogram recovery (for sitting down activity) using compressed sensing with measurements as low as 784 out of 50,176 (1.56%). No additive white Gaussian noise is considered. The left column shows the true spectrogram sample, the middle column shows reconstruction with an initial guess (no optimization) while the right column shows reconstruction with $\hat{Z}_{MAP}$.
  • Figure S1: M-VAE for a full data case: Single encoder network takes all modalities.
  • Figure S2: Opportunistic Passive WiFi Radar.
  • ...and 6 more figures