MSTAR: Multi-Scale Backbone Architecture Search for Timeseries Classification
Tue M. Cao, Nhat H. Tran, Hieu H. Pham, Hung T. Nguyen, Le P. Nguyen
TL;DR
Time Series Classification often hinges on capturing informative patterns across multiple time scales while preserving temporal localization. The authors introduce MSTAR, a multi-scale backbone search space and NAS framework that jointly optimizes receptive fields and time resolution using a cell-based, InceptionTime-inspired design encoded as a $4 \times 13 \times 13$ adjacency tensor. A convolutional autoencoder (CAE) and neural predictors guide Bayesian optimization to efficiently explore architectures, while a static encoder decodes candidates for evaluation, enabling scalable discovery. Across PTB-XL, EEGEyeNet, Smartphone HAR, and Satellite datasets, MSTAR achieves state-of-the-art performance and demonstrates strong compatibility with Vision Transformer backbones, highlighting the practical impact of time-resolution-aware architecture search for diverse time-series tasks.
Abstract
Most of the previous approaches to Time Series Classification (TSC) highlight the significance of receptive fields and frequencies while overlooking the time resolution. Hence, unavoidably suffered from scalability issues as they integrated an extensive range of receptive fields into classification models. Other methods, while having a better adaptation for large datasets, require manual design and yet not being able to reach the optimal architecture due to the uniqueness of each dataset. We overcome these challenges by proposing a novel multi-scale search space and a framework for Neural architecture search (NAS), which addresses both the problem of frequency and time resolution, discovering the suitable scale for a specific dataset. We further show that our model can serve as a backbone to employ a powerful Transformer module with both untrained and pre-trained weights. Our search space reaches the state-of-the-art performance on four datasets on four different domains while introducing more than ten highly fine-tuned models for each data.
