Table of Contents
Fetching ...

Class-Based Time Series Data Augmentation to Mitigate Extreme Class Imbalance for Solar Flare Prediction

Junzhi Wen, Rafal A. Angryk

TL;DR

This paper tackles the challenge of extreme class imbalance in multivariate time series for solar flare prediction. It introduces Mean Gaussian Noise (MGN), a class-based augmentation that synthesizes samples by perturbing per-time-step means of the underrepresented class, captured by $T' = \{\bar{t_1} \cdot (1 + \epsilon_1), \ldots, \bar{t_n} \cdot (1 + \epsilon_n)\}$ with $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$. Through experiments on the SWAN-SF dataset using TimeSeriesSVC, MGN is compared to eight basic augmentation methods and demonstrates superior or competitive performance across multiple partitions, while maintaining reasonable run-time. The results suggest that global, class-based augmentation can effectively expand coverage of the minority class in high-dimensional MVTS spaces, offering practical benefits for extremely imbalanced time-series classification tasks and motivating future extensions to broader datasets and classifiers. Overall, MGN provides a simple yet powerful augmentation tool that improves discrimination and reliability (as reflected in $TSS$ and $HSS2$) for solar flare prediction and potentially other rare-event MVTS problems.

Abstract

Time series data plays a crucial role across various domains, making it valuable for decision-making and predictive modeling. Machine learning (ML) and deep learning (DL) have shown promise in this regard, yet their performance hinges on data quality and quantity, often constrained by data scarcity and class imbalance, particularly for rare events like solar flares. Data augmentation techniques offer a potential solution to address these challenges, yet their effectiveness on multivariate time series datasets remains underexplored. In this study, we propose a novel data augmentation method for time series data named Mean Gaussian Noise (MGN). We investigate the performance of MGN compared to eight existing basic data augmentation methods on a multivariate time series dataset for solar flare prediction, SWAN-SF, using a ML algorithm for time series data, TimeSeriesSVC. The results demonstrate the efficacy of MGN and highlight its potential for improving classification performance in scenarios with extremely imbalanced data. Our time complexity analysis shows that MGN also has a competitive computational cost compared to the investigated alternative methods.

Class-Based Time Series Data Augmentation to Mitigate Extreme Class Imbalance for Solar Flare Prediction

TL;DR

This paper tackles the challenge of extreme class imbalance in multivariate time series for solar flare prediction. It introduces Mean Gaussian Noise (MGN), a class-based augmentation that synthesizes samples by perturbing per-time-step means of the underrepresented class, captured by with . Through experiments on the SWAN-SF dataset using TimeSeriesSVC, MGN is compared to eight basic augmentation methods and demonstrates superior or competitive performance across multiple partitions, while maintaining reasonable run-time. The results suggest that global, class-based augmentation can effectively expand coverage of the minority class in high-dimensional MVTS spaces, offering practical benefits for extremely imbalanced time-series classification tasks and motivating future extensions to broader datasets and classifiers. Overall, MGN provides a simple yet powerful augmentation tool that improves discrimination and reliability (as reflected in and ) for solar flare prediction and potentially other rare-event MVTS problems.

Abstract

Time series data plays a crucial role across various domains, making it valuable for decision-making and predictive modeling. Machine learning (ML) and deep learning (DL) have shown promise in this regard, yet their performance hinges on data quality and quantity, often constrained by data scarcity and class imbalance, particularly for rare events like solar flares. Data augmentation techniques offer a potential solution to address these challenges, yet their effectiveness on multivariate time series datasets remains underexplored. In this study, we propose a novel data augmentation method for time series data named Mean Gaussian Noise (MGN). We investigate the performance of MGN compared to eight existing basic data augmentation methods on a multivariate time series dataset for solar flare prediction, SWAN-SF, using a ML algorithm for time series data, TimeSeriesSVC. The results demonstrate the efficacy of MGN and highlight its potential for improving classification performance in scenarios with extremely imbalanced data. Our time complexity analysis shows that MGN also has a competitive computational cost compared to the investigated alternative methods.
Paper Structure (18 sections, 10 equations, 5 figures, 2 tables)

This paper contains 18 sections, 10 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Examples of eight different basic data augmentation methods on a single MVTS data instance from the flaring class of SWAN-SF with five common features recommended in bobra2015solar.
  • Figure 2: The difference in data generation between sample-based methods and class-based methods demonstrated using randomly generated 2D data. The synthetic data in (a) is generated by jittering and the synthetic data in (b) is generated by MGN.
  • Figure 3: Examples of Mean Gaussian Noise (MGN) applied to the common five features bobra2015solar of the flaring data (i.e., extremely rare class) in Partition 1 from SWAN-SF. The blue lines represent the mean time series and the dotted red lines are the generated time series by MGN. The values on the y-axis for each feature are normalized to compare with other methods (e.g., Fig. \ref{['fig:example']}).
  • Figure 4: The result of comparison between different data augmentation methods. Each dot represents the mean values of TSS and HSS2 of the ten runs of random undersampling for the corresponding data augmentation method. Both axes are zoomed in to enhance detail and improve visualization.
  • Figure 5: Run time of different data augmentation methods. (a) shows the computational cost of all nine different augmentation methods evaluated in this study. (b) provides a close-up view of the five overlapping methods from (a) for enhanced clarity.