Table of Contents
Fetching ...

STaRFormer: Semi-Supervised Task-Informed Representation Learning via Dynamic Attention-Based Regional Masking for Sequential Data

Maximilian Forstenhäusler, Daniel Külzer, Christos Anagnostopoulos, Shameem Puthiya Parambath, Natascha Weber

TL;DR

STaRFormer tackles the challenge of modeling non-stationary and irregularly sampled sequential data by introducing dynamic regional masking and a task-informed semi-supervised contrastive learning scheme. The method uses a Siamese encoder-Transformer architecture to learn robust latent representations that are aligned both batch-wise and class-wise, while jointly optimizing downstream tasks (classification, anomaly detection, and regression). Extensive experiments across 56 datasets and multiple downstream tasks demonstrate strong performance gains, especially under irregular sampling and non-stationarity, with notable robustness and latent-space separability. Limitations include training-time overhead from the masking and dual-view CL, but inference remains efficient and the framework shows broad applicability across time-series domains.

Abstract

Understanding user intent is essential for situational and context-aware decision-making. Motivated by a real-world scenario, this work addresses intent predictions of smart device users in the vicinity of vehicles by modeling sequential spatiotemporal data. However, in real-world scenarios, environmental factors and sensor limitations can result in non-stationary and irregularly sampled data, posing significant challenges. To address these issues, we propose STaRFormer, a Transformer-based approach that can serve as a universal framework for sequential modeling. STaRFormer utilizes a new dynamic attention-based regional masking scheme combined with a novel semi-supervised contrastive learning paradigm to enhance task-specific latent representations. Comprehensive experiments on 56 datasets varying in types (including non-stationary and irregularly sampled), tasks, domains, sequence lengths, training samples, and applications demonstrate the efficacy of STaRFormer, achieving notable improvements over state-of-the-art approaches.

STaRFormer: Semi-Supervised Task-Informed Representation Learning via Dynamic Attention-Based Regional Masking for Sequential Data

TL;DR

STaRFormer tackles the challenge of modeling non-stationary and irregularly sampled sequential data by introducing dynamic regional masking and a task-informed semi-supervised contrastive learning scheme. The method uses a Siamese encoder-Transformer architecture to learn robust latent representations that are aligned both batch-wise and class-wise, while jointly optimizing downstream tasks (classification, anomaly detection, and regression). Extensive experiments across 56 datasets and multiple downstream tasks demonstrate strong performance gains, especially under irregular sampling and non-stationarity, with notable robustness and latent-space separability. Limitations include training-time overhead from the masking and dual-view CL, but inference remains efficient and the framework shows broad applicability across time-series domains.

Abstract

Understanding user intent is essential for situational and context-aware decision-making. Motivated by a real-world scenario, this work addresses intent predictions of smart device users in the vicinity of vehicles by modeling sequential spatiotemporal data. However, in real-world scenarios, environmental factors and sensor limitations can result in non-stationary and irregularly sampled data, posing significant challenges. To address these issues, we propose STaRFormer, a Transformer-based approach that can serve as a universal framework for sequential modeling. STaRFormer utilizes a new dynamic attention-based regional masking scheme combined with a novel semi-supervised contrastive learning paradigm to enhance task-specific latent representations. Comprehensive experiments on 56 datasets varying in types (including non-stationary and irregularly sampled), tasks, domains, sequence lengths, training samples, and applications demonstrate the efficacy of STaRFormer, achieving notable improvements over state-of-the-art approaches.

Paper Structure

This paper contains 46 sections, 33 equations, 11 figures, 8 tables, 1 algorithm.

Figures (11)

  • Figure 1: Architecture of ; (a) High level Siamese network architecture - the left tower performs the downstream task while the right tower performs the reconstruction of the masked sequence. (b) The scheme exemplified by a single batch from the dataset with batch size 16 for an encoder with $N=4$ layers. ReM abbreviates regional mask.
  • Figure 2: tsne visualizations (plotted with perplexity 50) of latent spaces representations for the (a, b), (c, d), (e, f), and PS () (g, h) datasets, comparing Base and .
  • Figure 3: Example plots visualizing the non-stationary characteristics of the sequential data in the dataset. The red or orange line visualizes the mean and the green dashed lines the standard deviation of a segment. Multiple mean and standard deviation lines per plot indicate changes in the underlying generative distribution of the visualized data. These plots only serve as a demonstration and visualization of the non-stationary characteristics of data samples from the dataset.
  • Figure 4: An illustration demonstrating the collection and utilization of signal measurements to localize a smart device around the vehicle using a $\mathrm{multiplier}_{\mathrm{RAN}}=3$. Due to the continuous hopping strategy for ranging, fixed ranging round indices are set before the data is collected. In this case, ranging round indices [9, 2, 6] lead to a difference in the time delta between three following $\mathrm{R}_\mathrm{R}$'s, e.g., $|\Delta_t \mathrm{R}_{\mathrm{R}}^1 - \Delta_t \mathrm{R}_{\mathrm{R}}^2| = 264ms$ccc_digital_2024-1 .
  • Figure 5: Illustration of (a) a sequence using a first-order Markov chain and (b) a sequence using a Markov chain of latent variables.
  • ...and 6 more figures