Table of Contents
Fetching ...

Unsupervised Multi-modal Feature Alignment for Time Series Representation Learning

Chen Liang, Donghua Yang, Zhiyu Liang, Hongzhi Wang, Zheng Liang, Xiyang Zhang, Jianfeng Huang

TL;DR

This work tackles unsupervised representation learning for multivariate time series by addressing limitations of multi-modal feature fusion. It introduces MMFA, a model-agnostic framework that aligns latent representations across diverse time-series transforms using a graph spectral perspective, while preserving a single raw-time-series encoder for scalable inference. The approach is underpinned by theoretical results linking graph Laplacian eigenfunctions to embedding distances, invariance across modalities, and orthogonality, with recovery results to KPCA and KCCA under linear degeneracy. Empirically, MMFA demonstrates strong improvements over state-of-the-art URL methods and competitive performance against task-tailored supervised models across classification, clustering, and anomaly detection on 31 real-world datasets, with ablations illustrating the value of diverse transforms and asymmetric encoder alignment.

Abstract

In recent times, the field of unsupervised representation learning (URL) for time series data has garnered significant interest due to its remarkable adaptability across diverse downstream applications. Unsupervised learning goals differ from downstream tasks, making it tricky to ensure downstream task utility by focusing only on temporal feature characterization. Researchers have proposed multiple transformations to extract discriminative patterns implied in informative time series, trying to fill the gap. Despite the introduction of a variety of feature engineering techniques, e.g. spectral domain, wavelet transformed features, features in image form and symbolic features etc. the utilization of intricate feature fusion methods and dependence on heterogeneous features during inference hampers the scalability of the solutions. To address this, our study introduces an innovative approach that focuses on aligning and binding time series representations encoded from different modalities, inspired by spectral graph theory, thereby guiding the neural encoder to uncover latent pattern associations among these multi-modal features. In contrast to conventional methods that fuse features from multiple modalities, our proposed approach simplifies the neural architecture by retaining a single time series encoder, consequently leading to preserved scalability. We further demonstrate and prove mechanisms for the encoder to maintain better inductive bias. In our experimental evaluation, we validated the proposed method on a diverse set of time series datasets from various domains. Our approach outperforms existing state-of-the-art URL methods across diverse downstream tasks.

Unsupervised Multi-modal Feature Alignment for Time Series Representation Learning

TL;DR

This work tackles unsupervised representation learning for multivariate time series by addressing limitations of multi-modal feature fusion. It introduces MMFA, a model-agnostic framework that aligns latent representations across diverse time-series transforms using a graph spectral perspective, while preserving a single raw-time-series encoder for scalable inference. The approach is underpinned by theoretical results linking graph Laplacian eigenfunctions to embedding distances, invariance across modalities, and orthogonality, with recovery results to KPCA and KCCA under linear degeneracy. Empirically, MMFA demonstrates strong improvements over state-of-the-art URL methods and competitive performance against task-tailored supervised models across classification, clustering, and anomaly detection on 31 real-world datasets, with ablations illustrating the value of diverse transforms and asymmetric encoder alignment.

Abstract

In recent times, the field of unsupervised representation learning (URL) for time series data has garnered significant interest due to its remarkable adaptability across diverse downstream applications. Unsupervised learning goals differ from downstream tasks, making it tricky to ensure downstream task utility by focusing only on temporal feature characterization. Researchers have proposed multiple transformations to extract discriminative patterns implied in informative time series, trying to fill the gap. Despite the introduction of a variety of feature engineering techniques, e.g. spectral domain, wavelet transformed features, features in image form and symbolic features etc. the utilization of intricate feature fusion methods and dependence on heterogeneous features during inference hampers the scalability of the solutions. To address this, our study introduces an innovative approach that focuses on aligning and binding time series representations encoded from different modalities, inspired by spectral graph theory, thereby guiding the neural encoder to uncover latent pattern associations among these multi-modal features. In contrast to conventional methods that fuse features from multiple modalities, our proposed approach simplifies the neural architecture by retaining a single time series encoder, consequently leading to preserved scalability. We further demonstrate and prove mechanisms for the encoder to maintain better inductive bias. In our experimental evaluation, we validated the proposed method on a diverse set of time series datasets from various domains. Our approach outperforms existing state-of-the-art URL methods across diverse downstream tasks.
Paper Structure (25 sections, 7 theorems, 33 equations, 3 figures, 9 tables, 1 algorithm)

This paper contains 25 sections, 7 theorems, 33 equations, 3 figures, 9 tables, 1 algorithm.

Key Result

Theorem 1

Equivalence between Eigenvalues and Distance Reduction of Spectral Embeddings. $\mathbb{E}_{(x, x^\prime)\sim p_{\text{sim}}} [(f(x) - f(x^\prime))^2]$ denotes the expected squared difference between representations of data point pairs under distribution $p_{\text{sim}}$, and $f$ is an eigenfunction

Figures (3)

  • Figure 1: An overview of the MMFA framework. A range of transforms is applied to raw time series to produce informative features across multiple modalities. These features are subsequently processed by neural feature extractors to identify different patterns. Following this, the representations are mapped to an embedding space and aligned via the regularization method.
  • Figure 2: Two types of patterns of interest with three multi-modal feature views. (raw time series, CWT, and FFT of time series) With a green dash circle indicating patterns that are hard to distinguish and, a red dash circle indicating patterns that are easy to identify. The feature encoders take these multi-modal features as input. Difficulties for them to capture certain patterns vary with different transforms, causing different probabilities for them in determining two sample points to be identical.
  • Figure 3: Correlation plot illustrating the relationships between Cohen's d effect sizes of performance improvement made by MMFA on 30 UEA datasets and characteristics of the datasets, i.e., training size, time series dimensionality, and time length. Logarithmic transforms are employed to enhance the linearity of the data.

Theorems & Definitions (17)

  • Definition 1
  • Definition 2
  • Theorem 1
  • Remark 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • proof
  • proof
  • proof
  • ...and 7 more