Joint Embeddings Go Temporal
Sofiane Ennadir, Siavash Golkar, Leopoldo Sarra
TL;DR
This work addresses noise and confounding factors in time-series self-supervised learning by introducing TS-JEPA, a latent-space predictive architecture based on JEPA. TS-JEPA uses a Tokenizer, an Encoder, a Predictor, and an EMA-Encoder to predict masked latent representations, optimized via a latent-space loss $${\mathcal{L}} = \frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} || z'_i - t_i ||_1$$ with $z'_{\mathcal{M}} = P_{\beta}(E_{\theta}(\mathcal{P}_{\mathcal{N}}))_{\mathcal{M}}$. Empirically, TS-JEPA achieves strong classification performance, matches or surpasses MAE, and maintains competitive forecasting while offering improved stability and sample efficiency, highlighting its potential as a robust foundation model for time-series data. The results suggest that latent-space JEPA can provide balanced, transferable representations across classification and forecasting tasks, motivating future scaling and deployment of time-series foundation models.
Abstract
Self-supervised learning has seen great success recently in unsupervised representation learning, enabling breakthroughs in natural language and image processing. However, these methods often rely on autoregressive and masked modeling, which aim to reproduce masked information in the input, which can be vulnerable to the presence of noise or confounding variables. To address this problem, Joint-Embedding Predictive Architectures (JEPA) has been introduced with the aim to perform self-supervised learning in the latent space. To leverage these advancements in the domain of time series, we introduce Time Series JEPA (TS-JEPA), an architecture specifically adapted for time series representation learning. We validate TS-JEPA on both classification and forecasting, showing that it can match or surpass current state-of-the-art baselines on different standard datasets. Notably, our approach demonstrates a strong performance balance across diverse tasks, indicating its potential as a robust foundation for learning general representations. Thus, this work lays the groundwork for developing future time series foundation models based on Joint Embedding.
