Table of Contents
Fetching ...

X-Fi: A Modality-Invariant Foundation Model for Multimodal Human Sensing

Xinyan Chen, Jianfei Yang

TL;DR

X-Fi presents a modality-invariant foundation model for multimodal human sensing that can accommodate arbitrary and changing sensor modalities without retraining. It combines modality-specific feature encoders with an X-Fusion block, consisting of a cross-modal transformer and modality-specific cross-attention modules, to learn a unified cross-modal embedding while preserving distinct modality information. Training uses a modality-existence scheme that randomly activates subsets of modalities, enabling robust performance across diverse modality configurations. On MM-Fi and XRF55, X-Fi achieves state-of-the-art results for Human Pose Estimation and Human Activity Recognition, demonstrating strong generalization and practicality for scalable, real-world multimodal sensing systems.

Abstract

Human sensing, which employs various sensors and advanced deep learning technologies to accurately capture and interpret human body information, has significantly impacted fields like public security and robotics. However, current human sensing primarily depends on modalities such as cameras and LiDAR, each of which has its own strengths and limitations. Furthermore, existing multi-modal fusion solutions are typically designed for fixed modality combinations, requiring extensive retraining when modalities are added or removed for diverse scenarios. In this paper, we propose a modality-invariant foundation model for all modalities, X-Fi, to address this issue. X-Fi enables the independent or combinatory use of sensor modalities without additional training by utilizing a transformer structure to accommodate variable input sizes and incorporating a novel "X-fusion" mechanism to preserve modality-specific features during multimodal integration. This approach not only enhances adaptability but also facilitates the learning of complementary features across modalities. Extensive experiments conducted on the MM-Fi and XRF55 datasets, employing six distinct modalities, demonstrate that X-Fi achieves state-of-the-art performance in human pose estimation (HPE) and human activity recognition (HAR) tasks. The findings indicate that our proposed model can efficiently support a wide range of human sensing applications, ultimately contributing to the evolution of scalable, multimodal sensing technologies.

X-Fi: A Modality-Invariant Foundation Model for Multimodal Human Sensing

TL;DR

X-Fi presents a modality-invariant foundation model for multimodal human sensing that can accommodate arbitrary and changing sensor modalities without retraining. It combines modality-specific feature encoders with an X-Fusion block, consisting of a cross-modal transformer and modality-specific cross-attention modules, to learn a unified cross-modal embedding while preserving distinct modality information. Training uses a modality-existence scheme that randomly activates subsets of modalities, enabling robust performance across diverse modality configurations. On MM-Fi and XRF55, X-Fi achieves state-of-the-art results for Human Pose Estimation and Human Activity Recognition, demonstrating strong generalization and practicality for scalable, real-world multimodal sensing systems.

Abstract

Human sensing, which employs various sensors and advanced deep learning technologies to accurately capture and interpret human body information, has significantly impacted fields like public security and robotics. However, current human sensing primarily depends on modalities such as cameras and LiDAR, each of which has its own strengths and limitations. Furthermore, existing multi-modal fusion solutions are typically designed for fixed modality combinations, requiring extensive retraining when modalities are added or removed for diverse scenarios. In this paper, we propose a modality-invariant foundation model for all modalities, X-Fi, to address this issue. X-Fi enables the independent or combinatory use of sensor modalities without additional training by utilizing a transformer structure to accommodate variable input sizes and incorporating a novel "X-fusion" mechanism to preserve modality-specific features during multimodal integration. This approach not only enhances adaptability but also facilitates the learning of complementary features across modalities. Extensive experiments conducted on the MM-Fi and XRF55 datasets, employing six distinct modalities, demonstrate that X-Fi achieves state-of-the-art performance in human pose estimation (HPE) and human activity recognition (HAR) tasks. The findings indicate that our proposed model can efficiently support a wide range of human sensing applications, ultimately contributing to the evolution of scalable, multimodal sensing technologies.

Paper Structure

This paper contains 23 sections, 3 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: The left image depicts current human sensing solutions that are specifically designed and trained for fixed modality combinations, while the right image illustrates our proposed modality-invariant foundation model, X-Fi, which can be trained once and adapted to various scenarios.
  • Figure 2: The architecture of the proposed modality-invariant foundation model, X-Fi. X-Fi consists modality feature encoders and an X-Fusion module, which includes a cross-modal transformer and modality-specified cross-attention modules. The modalities with dotted lines represent inactivate modalities in the given scenario. The $N$ in X-Fusion block represents the number of iterations.
  • Figure 3: Comparison of predicted human skeletons.
  • Figure 4: Comparison of multi-modal embedding distribution for HAR. The upper right corner of the image provides an enlarged view of the red-boxed area from the original image.
  • Figure 5: The detailed comparison among the rgb HPE result, the depth HPE result, and the ground truth.
  • ...and 3 more figures