Table of Contents
Fetching ...

Mantis: Lightweight Calibrated Foundation Model for User-Friendly Time Series Classification

Vasilii Feofanov, Songkang Wen, Marius Alonso, Romain Ilbert, Hongbo Guo, Malik Tiomoko, Lujia Pan, Jianfeng Zhang, Ievgen Redko

TL;DR

Mantis introduces a lightweight Vision Transformer–based foundation model for time series classification, pre-trained with a contrastive objective on a large, diverse unlabeled dataset and released as open-source. It features a novel token generator that outputs 32 tokens from raw, differential, and statistical patches, a ViT backbone of 6 layers with a class token, and a flexible projector/prediction head for pre-training and fine-tuning. The authors also propose adapters to compress multivariate channels, enabling efficient inference and fine-tuning on high-channel data. Empirical results show Mantis achieves superior accuracy and calibration compared to state-of-the-art TS foundation models in both zero-shot and fine-tuning regimes, with practical guidance for adapters and calibration techniques in real-world deployments.

Abstract

In recent years, there has been increasing interest in developing foundation models for time series data that can generalize across diverse downstream tasks. While numerous forecasting-oriented foundation models have been introduced, there is a notable scarcity of models tailored for time series classification. To address this gap, we present Mantis, a new open-source foundation model for time series classification based on the Vision Transformer (ViT) architecture that has been pre-trained using a contrastive learning approach. Our experimental results show that Mantis outperforms existing foundation models both when the backbone is frozen and when fine-tuned, while achieving the lowest calibration error. In addition, we propose several adapters to handle the multivariate setting, reducing memory requirements and modeling channel interdependence.

Mantis: Lightweight Calibrated Foundation Model for User-Friendly Time Series Classification

TL;DR

Mantis introduces a lightweight Vision Transformer–based foundation model for time series classification, pre-trained with a contrastive objective on a large, diverse unlabeled dataset and released as open-source. It features a novel token generator that outputs 32 tokens from raw, differential, and statistical patches, a ViT backbone of 6 layers with a class token, and a flexible projector/prediction head for pre-training and fine-tuning. The authors also propose adapters to compress multivariate channels, enabling efficient inference and fine-tuning on high-channel data. Empirical results show Mantis achieves superior accuracy and calibration compared to state-of-the-art TS foundation models in both zero-shot and fine-tuning regimes, with practical guidance for adapters and calibration techniques in real-world deployments.

Abstract

In recent years, there has been increasing interest in developing foundation models for time series data that can generalize across diverse downstream tasks. While numerous forecasting-oriented foundation models have been introduced, there is a notable scarcity of models tailored for time series classification. To address this gap, we present Mantis, a new open-source foundation model for time series classification based on the Vision Transformer (ViT) architecture that has been pre-trained using a contrastive learning approach. Our experimental results show that Mantis outperforms existing foundation models both when the backbone is frozen and when fine-tuned, while achieving the lowest calibration error. In addition, we propose several adapters to handle the multivariate setting, reducing memory requirements and modeling channel interdependence.

Paper Structure

This paper contains 25 sections, 5 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Architecture. By symbol $+$ we denote the sum operator, while $||$ designates the vector concatenation operator.
  • Figure 2: RandomCropResize
  • Figure 3: Zero-shot feature extraction: comparison with the SOTA. The accuracy is averaged over 3 random seeds and over 159-D datasets.
  • Figure 4: The running time of MOMENT and Mantis on the UEA-27 collection of datasets. For better visualization, we sort and split the datasets into two clusters based on the reported running time.
  • Figure 5: Model fine-tuning: comparison with the SOTA. The accuracy is averaged over 3 random seeds and over 131-D datasets.
  • ...and 6 more figures