Table of Contents
Fetching ...

Data Quality Issues in Flare Prediction using Machine Learning Models

Ke Hu, Kevin Jin, Victor Verma, Weihao Liu, Ward Manchester, Lulu Zhao, Tamas Gombosi, Yang Chen

TL;DR

This study investigates how data quality and cross-source inconsistencies affect ML-based solar flare forecasting. It compares science-quality, operational SWPC-FTP, and SunPy-HEK lists for flare labels and analyzes imaging (HMI/AIA) and SHARP vector predictors, including augmentation of science-quality data with AR labels. Using two representative models (LSTM and logistic regression) across multiple forecast horizons and solar-cycle phases, it shows that data-product choices yield solar-cycle dependent performance differences and that near-real-time versus definitive predictor data influence stability and accuracy. The paper provides a reproducible data-processing pipeline, practical recommendations for data selection, and insights to improve interpretability and comparability in data-driven flare forecasting.

Abstract

Machine learning models for forecasting solar flares have been trained and tested using a variety of data sources, such as Space Weather Prediction Center (SWPC) operational and science-quality data. Typically, data from these sources is minimally processed before being used to train and validate a forecasting model. However, predictive performance can be impaired if defects in and inconsistencies between these data sources are ignored. For a number of commonly used data sources, together with softwares that query and then output processed data, we identify their respective defects and inconsistencies, quantify their extent, and show how they can affect the predictions produced by data-driven machine learning forecasting models. We also outline procedures for fixing these issues or at least mitigating their impacts. Finally, based on our thorough comparisons of the impacts of data sources on the trained forecasting model in terms of predictive skill scores, we offer recommendations for the use of different data products in operational forecasting.

Data Quality Issues in Flare Prediction using Machine Learning Models

TL;DR

This study investigates how data quality and cross-source inconsistencies affect ML-based solar flare forecasting. It compares science-quality, operational SWPC-FTP, and SunPy-HEK lists for flare labels and analyzes imaging (HMI/AIA) and SHARP vector predictors, including augmentation of science-quality data with AR labels. Using two representative models (LSTM and logistic regression) across multiple forecast horizons and solar-cycle phases, it shows that data-product choices yield solar-cycle dependent performance differences and that near-real-time versus definitive predictor data influence stability and accuracy. The paper provides a reproducible data-processing pipeline, practical recommendations for data selection, and insights to improve interpretability and comparability in data-driven flare forecasting.

Abstract

Machine learning models for forecasting solar flares have been trained and tested using a variety of data sources, such as Space Weather Prediction Center (SWPC) operational and science-quality data. Typically, data from these sources is minimally processed before being used to train and validate a forecasting model. However, predictive performance can be impaired if defects in and inconsistencies between these data sources are ignored. For a number of commonly used data sources, together with softwares that query and then output processed data, we identify their respective defects and inconsistencies, quantify their extent, and show how they can affect the predictions produced by data-driven machine learning forecasting models. We also outline procedures for fixing these issues or at least mitigating their impacts. Finally, based on our thorough comparisons of the impacts of data sources on the trained forecasting model in terms of predictive skill scores, we offer recommendations for the use of different data products in operational forecasting.

Paper Structure

This paper contains 24 sections, 1 equation, 20 figures, 15 tables.

Figures (20)

  • Figure 1: A schematic illustrating how a machine learning method produces predictions from predictors. Predictors are commonly computed from either images of active regions or summary statistical parameters on those regions. The upper left part of the schematic depicts predictors computed from HMI and AIA images, which can be represented as tensors with dimensions $h \times w \times c$, where $h$, $w$, and $c$ represent the height, width, and channel counts, respectively. For both kinds of images, $h, w = 4096$, with $c = 1$ for HMI images and $c = 10$ for AIA images. The lower left part depicts predictors computed from SHARP parameters, which are summary statistics calculated from HMI images. If the method performs classification, the outcome is the indicator of whether a flare will occur; if the method performs regression, the outcome is the future peak soft X-ray flux. The upper right and lower right show comparisons of actual values to mock predictions for classification and regression, respectively. In the upper right, the green curve represents mock flare probabilities; a classifier may output predicted classes or predicted class probabilities. The images and plots display data from around the time of an X8.7-class flare that occurred on 14 May 2024 in NOAA active region 13664 (HARP 11149).
  • Figure 2: The 1-minute averaged X-Ray flux in log10 scale and the flare events at May 3rd, 2022. The short, vertical lines represent the event time of flares. Not every local maximum in flux will be labeled as one flare event. The flux and the flare events are the Science-Quality data processed by NCEI.
  • Figure 3: Cumulative flare intensity since 2010/01/01 for SunPy-HEK and SWPC-FTP flare lists. The slope of the SunPy-HEK without a valid AR number after 2022 is suspiciously steep because more than 3000 flares are mislabeled with AR number 0. The relatively mild difference before 2020 is due to the fact that the different records in the two lists cancel each other out.
  • Figure 4: The flare peak flux ratio of SWPC-FTP list to the NCEI science-quality list through 2020/01/01 - 2024/7/21. Before December 2019, the SWPC applied a rescaling factor such that the ratio is centered at 0.7, while the centered ratio is back to 1 as GOES-16 became the primary operational satellite. The colored rectangle shows the data availability time range for GOES 13-18, which differs from the serving time as the primary satellite providing operational data. GOES-16 became the primary geostationary satellite for its operations in December 2019.
  • Figure 5: Flare event numbers from NCEI Science-Quality data and SWPC-FTP data. The dashed line represents SWPC-FTP flare records with a nonzero AR number. Prior to December 2019, fewer flares were recorded in the operational data due to the SWPC scaling factor. After the transition to primarily using GOES-16, the number of science-quality flare events is smaller due to the correction.
  • ...and 15 more figures