From Orthogonality to Dependency: Learning Disentangled Representation for Multi-Modal Time-Series Sensing Signals

Ruichu Cai; Zhifang Jiang; Zijian Li; Weilin Chen; Xuexin Chen; Zhifeng Hao; Yifan Shen; Guangyi Chen; Kun Zhang

From Orthogonality to Dependency: Learning Disentangled Representation for Multi-Modal Time-Series Sensing Signals

Ruichu Cai, Zhifang Jiang, Zijian Li, Weilin Chen, Xuexin Chen, Zhifeng Hao, Yifan Shen, Guangyi Chen, Kun Zhang

TL;DR

The paper addresses disentangling dependent modality-shared and modality-specific latent factors in multi-modal time-series data. It introduces MATE, a temporally variational framework with shared and private priors and a modality-shared constraint to achieve disentanglement under non-orthogonal latent spaces. The authors establish subspace and component-wise identifiability results rooted in nonlinear ICA concepts and demonstrate superior performance across diverse datasets and tasks. This approach offers a principled, robust pathway for real-world multi-modal time-series analysis with practical implications for IoT, healthcare, and beyond.

Abstract

Existing methods for multi-modal time series representation learning aim to disentangle the modality-shared and modality-specific latent variables. Although achieving notable performances on downstream tasks, they usually assume an orthogonal latent space. However, the modality-specific and modality-shared latent variables might be dependent on real-world scenarios. Therefore, we propose a general generation process, where the modality-shared and modality-specific latent variables are dependent, and further develop a \textbf{M}ulti-mod\textbf{A}l \textbf{TE}mporal Disentanglement (\textbf{MATE}) model. Specifically, our \textbf{MATE} model is built on a temporally variational inference architecture with the modality-shared and modality-specific prior networks for the disentanglement of latent variables. Furthermore, we establish identifiability results to show that the extracted representation is disentangled. More specifically, we first achieve the subspace identifiability for modality-shared and modality-specific latent variables by leveraging the pairing of multi-modal data. Then we establish the component-wise identifiability of modality-specific latent variables by employing sufficient changes of historical latent variables. Extensive experimental studies on multi-modal sensors, human activity recognition, and healthcare datasets show a general improvement in different downstream tasks, highlighting the effectiveness of our method in real-world scenarios.

From Orthogonality to Dependency: Learning Disentangled Representation for Multi-Modal Time-Series Sensing Signals

TL;DR

Abstract

Paper Structure (35 sections, 4 theorems, 50 equations, 5 figures, 11 tables)

This paper contains 35 sections, 4 theorems, 50 equations, 5 figures, 11 tables.

Introduction
Problem Setup
Data Generation Process of Multi-modal Time Series
Problem Definition
MATE: Multi-modal Temporal Disentanglement Model
Variational-Inference-based Neural Architecture
Specific and Shared Prior Networks
Model Summary
Theoretical Analysis
Subspace Identifiability and Component-wise Identifiability
Subspace Identifiability of Latent Variables
Component-wise Identifiability of Modality-shared Latent Variables
Relationships between Identifiability and Representation Learning
Experiments
Experiment Setup
...and 20 more sections

Key Result

Theorem 1

(Subspace Identification of the Modality-shared and Modality-specific Latent Variables) Suppose that the observed data from different modalities is generated following the data generation process in Figure fig:generation, and we further make the following assumptions: Then if $\hat{g}_1:\mathcal{Z}_t^c\times\mathcal{Z}_t^{s_1}\rightarrow \mathcal{X}_t^{s_1}$ and $\hat{g}_2:\mathcal{Z}_t^c\times\m

Figures (5)

Figure 1: Illustration of physiological indicators of diabetics, where brain-related and heart-related signals are observations. (a) In the true generation process, observations are generated from dependent latent sources. (b) In the estimation process, enforcing orthogonality on estimated sources can result in the entanglement of latent sources and meaningless noises.
Figure 2: Data generation process of time series data with two modalities. The grey and white nodes denote the observed and latent variables.
Figure 3: Illustration of the proposed MATE model, we consider two modalities for a convenient understanding, more modalities can be easily extended. The modality-specific encoders are used to extract the latent variables of different modalities. The specific prior networks and the shared prior network are used to estimate the prior distribution for KL divergence.
Figure A1: Ablation study on the DINAMO and the Motion datasets.
Figure A2: The t-SNE visualization of the extracted domain-shared latent variables.

Theorems & Definitions (6)

Theorem 1
Corollary 1.1
Theorem A1
proof
Corollary A1
proof

From Orthogonality to Dependency: Learning Disentangled Representation for Multi-Modal Time-Series Sensing Signals

TL;DR

Abstract

From Orthogonality to Dependency: Learning Disentangled Representation for Multi-Modal Time-Series Sensing Signals

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (6)