Unsupervised Multi-modal Feature Alignment for Time Series Representation Learning

Chen Liang; Donghua Yang; Zhiyu Liang; Hongzhi Wang; Zheng Liang; Xiyang Zhang; Jianfeng Huang

Unsupervised Multi-modal Feature Alignment for Time Series Representation Learning

Chen Liang, Donghua Yang, Zhiyu Liang, Hongzhi Wang, Zheng Liang, Xiyang Zhang, Jianfeng Huang

TL;DR

This work tackles unsupervised representation learning for multivariate time series by addressing limitations of multi-modal feature fusion. It introduces MMFA, a model-agnostic framework that aligns latent representations across diverse time-series transforms using a graph spectral perspective, while preserving a single raw-time-series encoder for scalable inference. The approach is underpinned by theoretical results linking graph Laplacian eigenfunctions to embedding distances, invariance across modalities, and orthogonality, with recovery results to KPCA and KCCA under linear degeneracy. Empirically, MMFA demonstrates strong improvements over state-of-the-art URL methods and competitive performance against task-tailored supervised models across classification, clustering, and anomaly detection on 31 real-world datasets, with ablations illustrating the value of diverse transforms and asymmetric encoder alignment.

Abstract

In recent times, the field of unsupervised representation learning (URL) for time series data has garnered significant interest due to its remarkable adaptability across diverse downstream applications. Unsupervised learning goals differ from downstream tasks, making it tricky to ensure downstream task utility by focusing only on temporal feature characterization. Researchers have proposed multiple transformations to extract discriminative patterns implied in informative time series, trying to fill the gap. Despite the introduction of a variety of feature engineering techniques, e.g. spectral domain, wavelet transformed features, features in image form and symbolic features etc. the utilization of intricate feature fusion methods and dependence on heterogeneous features during inference hampers the scalability of the solutions. To address this, our study introduces an innovative approach that focuses on aligning and binding time series representations encoded from different modalities, inspired by spectral graph theory, thereby guiding the neural encoder to uncover latent pattern associations among these multi-modal features. In contrast to conventional methods that fuse features from multiple modalities, our proposed approach simplifies the neural architecture by retaining a single time series encoder, consequently leading to preserved scalability. We further demonstrate and prove mechanisms for the encoder to maintain better inductive bias. In our experimental evaluation, we validated the proposed method on a diverse set of time series datasets from various domains. Our approach outperforms existing state-of-the-art URL methods across diverse downstream tasks.

Unsupervised Multi-modal Feature Alignment for Time Series Representation Learning

TL;DR

Abstract

Paper Structure (25 sections, 7 theorems, 33 equations, 3 figures, 9 tables, 1 algorithm)

This paper contains 25 sections, 7 theorems, 33 equations, 3 figures, 9 tables, 1 algorithm.

Introduction
Related Work
Representation Learning for Time Series
Feature Transforms
Preliminaries
Unsupervised Representation Learning for Multivariate Time Series
Multi-modal Features and Neural Encoders
Neural Encoder
Overview
Regularization Based Multi-Modal Feature Alignment Algorithm
Training Objective Approximating GLE
Asymmetric Encoders Alignment Optimizing Algorithm
Empirical Analysis
Experiment Details
Experimental Settings
...and 10 more sections

Key Result

Theorem 1

Equivalence between Eigenvalues and Distance Reduction of Spectral Embeddings. $\mathbb{E}_{(x, x^\prime)\sim p_{\text{sim}}} [(f(x) - f(x^\prime))^2]$ denotes the expected squared difference between representations of data point pairs under distribution $p_{\text{sim}}$, and $f$ is an eigenfunction

Figures (3)

Figure 1: An overview of the MMFA framework. A range of transforms is applied to raw time series to produce informative features across multiple modalities. These features are subsequently processed by neural feature extractors to identify different patterns. Following this, the representations are mapped to an embedding space and aligned via the regularization method.
Figure 2: Two types of patterns of interest with three multi-modal feature views. (raw time series, CWT, and FFT of time series) With a green dash circle indicating patterns that are hard to distinguish and, a red dash circle indicating patterns that are easy to identify. The feature encoders take these multi-modal features as input. Difficulties for them to capture certain patterns vary with different transforms, causing different probabilities for them in determining two sample points to be identical.
Figure 3: Correlation plot illustrating the relationships between Cohen's d effect sizes of performance improvement made by MMFA on 30 UEA datasets and characteristics of the datasets, i.e., training size, time series dimensionality, and time length. Logarithmic transforms are employed to enhance the linearity of the data.

Theorems & Definitions (17)

Definition 1
Definition 2
Theorem 1
Remark 1
Theorem 2
Theorem 3
Theorem 4
proof
proof
proof
...and 7 more

Unsupervised Multi-modal Feature Alignment for Time Series Representation Learning

TL;DR

Abstract

Unsupervised Multi-modal Feature Alignment for Time Series Representation Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (17)