Table of Contents
Fetching ...

Redundancy-Free View Alignment for Multimodal Human Activity Recognition with Arbitrarily Missing Views

Duc-Anh Nguyen, Nhien-An Le-Khac

TL;DR

This paper tackles the challenge of flexible multimodal multiview HAR with arbitrarily missing views. It introduces RALIS, which combines an adjusted center contrastive loss with view-weighted fusion and a sparse mixture-of-experts head to avoid reconstructing missing modalities. The AC loss aligns per-view features toward a weighted center on a hypersphere, incorporating view quality while reducing complexity from $O(V^2)$ to $O(V)$. A dedicated load-balancing strategy for the MoE head addresses residual discrepancies across view combinations, yielding robust performance across four diverse HAR datasets under varying view availability. Overall, RALIS demonstrates strong robustness to missing views and improved efficiency, with practical implications for deployable multimodal HAR systems.

Abstract

Multimodal multiview learning seeks to integrate information from diverse sources to enhance task performance. Existing approaches often struggle with flexible view configurations, including arbitrary view combinations, numbers of views, and heterogeneous modalities. Focusing on the context of human activity recognition, we propose RALIS, a model that combines multiview contrastive learning with a mixture-of-experts module to support arbitrary view availability during both training and inference. Instead of trying to reconstruct missing views, an adjusted center contrastive loss is used for self-supervised representation learning and view alignment, mitigating the impact of missing views on multiview fusion. This loss formulation allows for the integration of view weights to account for view quality. Additionally, it reduces computational complexity from $O(V^2)$ to $O(V)$, where $V$ is the number of views. To address residual discrepancies not captured by contrastive learning, we employ a mixture-of-experts module with a specialized load balancing strategy, tasked with adapting to arbitrary view combinations. We highlight the geometric relationship among components in our model and how they combine well in the latent space. RALIS is validated on four datasets encompassing inertial and human pose modalities, with the number of views ranging from three to nine, demonstrating its performance and flexibility.

Redundancy-Free View Alignment for Multimodal Human Activity Recognition with Arbitrarily Missing Views

TL;DR

This paper tackles the challenge of flexible multimodal multiview HAR with arbitrarily missing views. It introduces RALIS, which combines an adjusted center contrastive loss with view-weighted fusion and a sparse mixture-of-experts head to avoid reconstructing missing modalities. The AC loss aligns per-view features toward a weighted center on a hypersphere, incorporating view quality while reducing complexity from to . A dedicated load-balancing strategy for the MoE head addresses residual discrepancies across view combinations, yielding robust performance across four diverse HAR datasets under varying view availability. Overall, RALIS demonstrates strong robustness to missing views and improved efficiency, with practical implications for deployable multimodal HAR systems.

Abstract

Multimodal multiview learning seeks to integrate information from diverse sources to enhance task performance. Existing approaches often struggle with flexible view configurations, including arbitrary view combinations, numbers of views, and heterogeneous modalities. Focusing on the context of human activity recognition, we propose RALIS, a model that combines multiview contrastive learning with a mixture-of-experts module to support arbitrary view availability during both training and inference. Instead of trying to reconstruct missing views, an adjusted center contrastive loss is used for self-supervised representation learning and view alignment, mitigating the impact of missing views on multiview fusion. This loss formulation allows for the integration of view weights to account for view quality. Additionally, it reduces computational complexity from to , where is the number of views. To address residual discrepancies not captured by contrastive learning, we employ a mixture-of-experts module with a specialized load balancing strategy, tasked with adapting to arbitrary view combinations. We highlight the geometric relationship among components in our model and how they combine well in the latent space. RALIS is validated on four datasets encompassing inertial and human pose modalities, with the number of views ranging from three to nine, demonstrating its performance and flexibility.
Paper Structure (32 sections, 13 equations, 9 figures, 6 tables)

This paper contains 32 sections, 13 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Overview of RALIS. Dashed lines indicate a missing view, which is excluded from computation. Red arrows represent stop-gradient connections, while black arrows allow gradient flow during backpropagation. During inference, the Weighted fusion block's output is the only input to the MoE block.
  • Figure 2: Illustration of fusion robustness to view missing. When views are closer together (right), the fusion shifts less upon view removal than when views are dispersed (left).
  • Figure 3: Adjusted center contrastive loss. Each view is contrasted with the other views' center on the hypersphere.
  • Figure 4: Python-style pseudocode for adjusted center contrastive loss
  • Figure 5: Fusion of two vectors. The fused vector $z^{(wf)}$ is oriented more toward the vector with a higher weight.
  • ...and 4 more figures