Table of Contents
Fetching ...

Finding Foundation Models for Time Series Classification with a PreText Task

Ali Ismail-Fawaz, Maxime Devanne, Stefano Berretti, Jonathan Weber, Germain Forestier

TL;DR

The overfitting challenge is addressed by introducing pre-trained domain foundation models and this strategy effectively reduces overfitting in small datasets and provides an efficient route for adapting these models to new datasets, thus advancing the capabilities of deep learning in TSC.

Abstract

Over the past decade, Time Series Classification (TSC) has gained an increasing attention. While various methods were explored, deep learning - particularly through Convolutional Neural Networks (CNNs)-stands out as an effective approach. However, due to the limited availability of training data, defining a foundation model for TSC that overcomes the overfitting problem is still a challenging task. The UCR archive, encompassing a wide spectrum of datasets ranging from motion recognition to ECG-based heart disease detection, serves as a prime example for exploring this issue in diverse TSC scenarios. In this paper, we address the overfitting challenge by introducing pre-trained domain foundation models. A key aspect of our methodology is a novel pretext task that spans multiple datasets. This task is designed to identify the originating dataset of each time series sample, with the goal of creating flexible convolution filters that can be applied across different datasets. The research process consists of two phases: a pre-training phase where the model acquires general features through the pretext task, and a subsequent fine-tuning phase for specific dataset classifications. Our extensive experiments on the UCR archive demonstrate that this pre-training strategy significantly outperforms the conventional training approach without pre-training. This strategy effectively reduces overfitting in small datasets and provides an efficient route for adapting these models to new datasets, thus advancing the capabilities of deep learning in TSC.

Finding Foundation Models for Time Series Classification with a PreText Task

TL;DR

The overfitting challenge is addressed by introducing pre-trained domain foundation models and this strategy effectively reduces overfitting in small datasets and provides an efficient route for adapting these models to new datasets, thus advancing the capabilities of deep learning in TSC.

Abstract

Over the past decade, Time Series Classification (TSC) has gained an increasing attention. While various methods were explored, deep learning - particularly through Convolutional Neural Networks (CNNs)-stands out as an effective approach. However, due to the limited availability of training data, defining a foundation model for TSC that overcomes the overfitting problem is still a challenging task. The UCR archive, encompassing a wide spectrum of datasets ranging from motion recognition to ECG-based heart disease detection, serves as a prime example for exploring this issue in diverse TSC scenarios. In this paper, we address the overfitting challenge by introducing pre-trained domain foundation models. A key aspect of our methodology is a novel pretext task that spans multiple datasets. This task is designed to identify the originating dataset of each time series sample, with the goal of creating flexible convolution filters that can be applied across different datasets. The research process consists of two phases: a pre-training phase where the model acquires general features through the pretext task, and a subsequent fine-tuning phase for specific dataset classifications. Our extensive experiments on the UCR archive demonstrate that this pre-training strategy significantly outperforms the conventional training approach without pre-training. This strategy effectively reduces overfitting in small datasets and provides an efficient route for adapting these models to new datasets, thus advancing the capabilities of deep learning in TSC.
Paper Structure (22 sections, 8 figures, 2 tables, 2 algorithms)

This paper contains 22 sections, 8 figures, 2 tables, 2 algorithms.

Figures (8)

  • Figure 1: Summary of the proposed pretext task approach. Given an archive of $N$ datasets, the first step is to train a pre-trained model (in blue) on all of the datasets, where the classification task is to predict the dataset each time series belongs to. The second step is to copy the pre-trained model and follow it with an addon model (in green) randomly initialized. The second step is done for each of the $N$ datasets of the archive independently. After constructing the $N$ new models, they are fine tuned on each dataset depending on the task of each one.
  • Figure 2: The architecture of H-Inception divided into two sub-models. The first model is the pre-trained model, trained on the pretext task (dotted green rectangle), while the second model is the randomly initialized add-on model (dotted red rectangle). The H-Inception model is made of six Inception modules, where each module contains three convolution layers (in orange) and a MAxPooling layer (in magenta) followed by a concatenation (in yellow), a batch normalization layer (in oily) and an activation function (in red). Each Inception module, except the first one, is proceeded by a bottleneck layer (in purple) to reduce the dimensionality and hence the number of parameters. The first Inception module contains the hybrid addition, which is the hand-crafted convolution filter (in green). Residual connections exist between the input and the third module, as well as between the third module and the output (in cyan).
  • Figure 3: An example using the proposed Batch Normalizing Multiplexer (BNM) that solves the problem of learning a batch normalization layer on multiple samples of different distributions (datasets). The BNM is made of multiple batch normalization layers (in oily with blue and red contours) proceeded by a multiplexer. This multiplexer has three different nodes: (a) input node, where the input time series goes through, (b) the control node, where the information about the dataset this input time series belong to goes through, and (c) the output node. The path selected for the output node is controlled by the node (b). It is important to note that the BNM, such as the traditional batch normalization layer, learns on the whole batch. The only difference is that more than one batch normalization layer will be fed by parts of this batch, which intuitively means the flow of information is slower when using the BNM.
  • Figure 4: A 1v1 scatter plot that compares the performance of H-InceptionTime (baseline) and PHIT following the accuracy metric. Each point represents a dataset, where the $x$ and $y$ axis represent the accuracy of H-InceptionTime and PHIT, respectively. A blue point represents a win for PHIT, an orange point a win for H-InceptionTime and a green point a tie.
  • Figure 5: Comparing the performance of the proposed approach and its change with respect to the training set size. The curve represented in blue is the difference in performance between the proposed approach and the baseline. A positive value represents a win for the pre-training approach. For each plot, we show this comparison on the datasets of the same type in the UCR archive. The $x$-axis represents the number of training examples (in $\log_{10}$ scale). The $y$-axis represents the difference of accuracy between the usage of our pre-training approach and the baseline.
  • ...and 3 more figures