Table of Contents
Fetching ...

COMPASS: Complete Multimodal Fusion via Proxy Tokens and Shared Spaces for Ubiquitous Sensing

Hao Wang, Yanyu Qian, Pengcheng Weng, Zixuan Xia, William Dan, Yangxin Xu, Fei Wang

Abstract

Missing modalities remain a major challenge for multimodal sensing, because most existing methods adapt the fusion process to the observed subset by dropping absent branches, using subset-specific fusion, or reconstructing missing features. As a result, the fusion head often receives an input structure different from the one seen during training, leading to incomplete fusion and degraded cross-modal interaction. We propose COMPASS, a missing-modality fusion framework built on the principle of fusion completeness: the fusion head always receives a fixed N-slot multimodal input, with one token per modality slot. For each missing modality, COMPASS synthesizes a target-specific proxy token from the observed modalities using pairwise source-to-target generators in a shared latent space, and aggregates them into a single replacement token. To make these proxies both representation-compatible and task-informative, we combine proxy alignment, shared-space regularization, and per-proxy discriminative supervision. Experiments on XRF55, MM-Fi, and OctoNet under diverse single- and multiple-missing settings show that COMPASS outperforms prior methods on the large majority of scenarios. Our results suggest that preserving a modality-complete fusion interface is a simple and effective design principle for robust multimodal sensing.

COMPASS: Complete Multimodal Fusion via Proxy Tokens and Shared Spaces for Ubiquitous Sensing

Abstract

Missing modalities remain a major challenge for multimodal sensing, because most existing methods adapt the fusion process to the observed subset by dropping absent branches, using subset-specific fusion, or reconstructing missing features. As a result, the fusion head often receives an input structure different from the one seen during training, leading to incomplete fusion and degraded cross-modal interaction. We propose COMPASS, a missing-modality fusion framework built on the principle of fusion completeness: the fusion head always receives a fixed N-slot multimodal input, with one token per modality slot. For each missing modality, COMPASS synthesizes a target-specific proxy token from the observed modalities using pairwise source-to-target generators in a shared latent space, and aggregates them into a single replacement token. To make these proxies both representation-compatible and task-informative, we combine proxy alignment, shared-space regularization, and per-proxy discriminative supervision. Experiments on XRF55, MM-Fi, and OctoNet under diverse single- and multiple-missing settings show that COMPASS outperforms prior methods on the large majority of scenarios. Our results suggest that preserving a modality-complete fusion interface is a simple and effective design principle for robust multimodal sensing.

Paper Structure

This paper contains 29 sections, 17 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: Comparison of fusion input layouts under different missing-modality strategies. (a) All modalities present: the fusion head receives the full token set. (b) Skip: absent modalities are dropped, changing the fusion input structure. (c) Imputation: a single observed modality generates a proxy token for the missing slot. (d) COMPASS: all observed modalities contribute directed proxy tokens, which are aggregated into each missing slot, preserving the fixed $N$-token fusion interface.
  • Figure 2: The COMPASS framework, shown with modality $v_j$ missing at inference. Each observed modality $v_i$ passes through its encoder $E_i$ and projection $P_i$, producing a token sequence $\mathbf{z}_i \in \mathbb{R}^{L \times d}$ in a shared latent space; mean-pooling then gives a global token $\bar{\mathbf{z}}_i \in \mathbb{R}^{d}$. When $v_j$ is absent, each observed source feeds $\mathbf{z}_i$ into a directed generator $G_{i \to j}$ (a single-layer Transformer with a learnable proxy query), and the per-source outputs are mean-averaged ($\oplus$) into one proxy token $\hat{\mathbf{t}}_j$ for the missing slot. Every slot receives exactly one token---a real global token or an aggregated proxy---so the fusion input is always $N$ tokens. These are summed ($\Sigma$), layer-normalized, and fed to the task head for prediction $\hat{\mathbf{y}}$. Dashed lines mark training-only signals: $\mathcal{L}_{\mathrm{align}}$ pulls each proxy toward the real target $\bar{\mathbf{z}}_j$ (available via synthetic masking), while $\mathcal{L}_{\mathrm{ss}}$ (VICReg) regularizes pairs of observed global tokens.
  • Figure 3: Representation geometry on XRF55. (a) X-Fi embeddings colored by modality: three modalities remain separated, indicating weak shared-space alignment. (b) COMPASS embeddings: modality boundaries are substantially reduced, enabling reliable cross-modal proxy transfer. (c) Quantitative comparison: each point is one class; x-axis is cross-modal compactness (cosine similarity between same-class centroids from different modalities), y-axis is inter-class separation (minimum cosine distance to other class centroids). COMPASS improves both alignment and separation.