Table of Contents
Fetching ...

From Orthogonality to Dependency: Learning Disentangled Representation for Multi-Modal Time-Series Sensing Signals

Ruichu Cai, Zhifang Jiang, Zijian Li, Weilin Chen, Xuexin Chen, Zhifeng Hao, Yifan Shen, Guangyi Chen, Kun Zhang

TL;DR

The paper addresses disentangling dependent modality-shared and modality-specific latent factors in multi-modal time-series data. It introduces MATE, a temporally variational framework with shared and private priors and a modality-shared constraint to achieve disentanglement under non-orthogonal latent spaces. The authors establish subspace and component-wise identifiability results rooted in nonlinear ICA concepts and demonstrate superior performance across diverse datasets and tasks. This approach offers a principled, robust pathway for real-world multi-modal time-series analysis with practical implications for IoT, healthcare, and beyond.

Abstract

Existing methods for multi-modal time series representation learning aim to disentangle the modality-shared and modality-specific latent variables. Although achieving notable performances on downstream tasks, they usually assume an orthogonal latent space. However, the modality-specific and modality-shared latent variables might be dependent on real-world scenarios. Therefore, we propose a general generation process, where the modality-shared and modality-specific latent variables are dependent, and further develop a \textbf{M}ulti-mod\textbf{A}l \textbf{TE}mporal Disentanglement (\textbf{MATE}) model. Specifically, our \textbf{MATE} model is built on a temporally variational inference architecture with the modality-shared and modality-specific prior networks for the disentanglement of latent variables. Furthermore, we establish identifiability results to show that the extracted representation is disentangled. More specifically, we first achieve the subspace identifiability for modality-shared and modality-specific latent variables by leveraging the pairing of multi-modal data. Then we establish the component-wise identifiability of modality-specific latent variables by employing sufficient changes of historical latent variables. Extensive experimental studies on multi-modal sensors, human activity recognition, and healthcare datasets show a general improvement in different downstream tasks, highlighting the effectiveness of our method in real-world scenarios.

From Orthogonality to Dependency: Learning Disentangled Representation for Multi-Modal Time-Series Sensing Signals

TL;DR

The paper addresses disentangling dependent modality-shared and modality-specific latent factors in multi-modal time-series data. It introduces MATE, a temporally variational framework with shared and private priors and a modality-shared constraint to achieve disentanglement under non-orthogonal latent spaces. The authors establish subspace and component-wise identifiability results rooted in nonlinear ICA concepts and demonstrate superior performance across diverse datasets and tasks. This approach offers a principled, robust pathway for real-world multi-modal time-series analysis with practical implications for IoT, healthcare, and beyond.

Abstract

Existing methods for multi-modal time series representation learning aim to disentangle the modality-shared and modality-specific latent variables. Although achieving notable performances on downstream tasks, they usually assume an orthogonal latent space. However, the modality-specific and modality-shared latent variables might be dependent on real-world scenarios. Therefore, we propose a general generation process, where the modality-shared and modality-specific latent variables are dependent, and further develop a \textbf{M}ulti-mod\textbf{A}l \textbf{TE}mporal Disentanglement (\textbf{MATE}) model. Specifically, our \textbf{MATE} model is built on a temporally variational inference architecture with the modality-shared and modality-specific prior networks for the disentanglement of latent variables. Furthermore, we establish identifiability results to show that the extracted representation is disentangled. More specifically, we first achieve the subspace identifiability for modality-shared and modality-specific latent variables by leveraging the pairing of multi-modal data. Then we establish the component-wise identifiability of modality-specific latent variables by employing sufficient changes of historical latent variables. Extensive experimental studies on multi-modal sensors, human activity recognition, and healthcare datasets show a general improvement in different downstream tasks, highlighting the effectiveness of our method in real-world scenarios.
Paper Structure (35 sections, 4 theorems, 50 equations, 5 figures, 11 tables)

This paper contains 35 sections, 4 theorems, 50 equations, 5 figures, 11 tables.

Key Result

Theorem 1

(Subspace Identification of the Modality-shared and Modality-specific Latent Variables) Suppose that the observed data from different modalities is generated following the data generation process in Figure fig:generation, and we further make the following assumptions: Then if $\hat{g}_1:\mathcal{Z}_t^c\times\mathcal{Z}_t^{s_1}\rightarrow \mathcal{X}_t^{s_1}$ and $\hat{g}_2:\mathcal{Z}_t^c\times\m

Figures (5)

  • Figure 1: Illustration of physiological indicators of diabetics, where brain-related and heart-related signals are observations. (a) In the true generation process, observations are generated from dependent latent sources. (b) In the estimation process, enforcing orthogonality on estimated sources can result in the entanglement of latent sources and meaningless noises.
  • Figure 2: Data generation process of time series data with two modalities. The grey and white nodes denote the observed and latent variables.
  • Figure 3: Illustration of the proposed MATE model, we consider two modalities for a convenient understanding, more modalities can be easily extended. The modality-specific encoders are used to extract the latent variables of different modalities. The specific prior networks and the shared prior network are used to estimate the prior distribution for KL divergence.
  • Figure A1: Ablation study on the DINAMO and the Motion datasets.
  • Figure A2: The t-SNE visualization of the extracted domain-shared latent variables.

Theorems & Definitions (6)

  • Theorem 1
  • Corollary 1.1
  • Theorem A1
  • proof
  • Corollary A1
  • proof