Class-Based Time Series Data Augmentation to Mitigate Extreme Class Imbalance for Solar Flare Prediction
Junzhi Wen, Rafal A. Angryk
TL;DR
This paper tackles the challenge of extreme class imbalance in multivariate time series for solar flare prediction. It introduces Mean Gaussian Noise (MGN), a class-based augmentation that synthesizes samples by perturbing per-time-step means of the underrepresented class, captured by $T' = \{\bar{t_1} \cdot (1 + \epsilon_1), \ldots, \bar{t_n} \cdot (1 + \epsilon_n)\}$ with $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$. Through experiments on the SWAN-SF dataset using TimeSeriesSVC, MGN is compared to eight basic augmentation methods and demonstrates superior or competitive performance across multiple partitions, while maintaining reasonable run-time. The results suggest that global, class-based augmentation can effectively expand coverage of the minority class in high-dimensional MVTS spaces, offering practical benefits for extremely imbalanced time-series classification tasks and motivating future extensions to broader datasets and classifiers. Overall, MGN provides a simple yet powerful augmentation tool that improves discrimination and reliability (as reflected in $TSS$ and $HSS2$) for solar flare prediction and potentially other rare-event MVTS problems.
Abstract
Time series data plays a crucial role across various domains, making it valuable for decision-making and predictive modeling. Machine learning (ML) and deep learning (DL) have shown promise in this regard, yet their performance hinges on data quality and quantity, often constrained by data scarcity and class imbalance, particularly for rare events like solar flares. Data augmentation techniques offer a potential solution to address these challenges, yet their effectiveness on multivariate time series datasets remains underexplored. In this study, we propose a novel data augmentation method for time series data named Mean Gaussian Noise (MGN). We investigate the performance of MGN compared to eight existing basic data augmentation methods on a multivariate time series dataset for solar flare prediction, SWAN-SF, using a ML algorithm for time series data, TimeSeriesSVC. The results demonstrate the efficacy of MGN and highlight its potential for improving classification performance in scenarios with extremely imbalanced data. Our time complexity analysis shows that MGN also has a competitive computational cost compared to the investigated alternative methods.
