Data Quality Issues in Flare Prediction using Machine Learning Models
Ke Hu, Kevin Jin, Victor Verma, Weihao Liu, Ward Manchester, Lulu Zhao, Tamas Gombosi, Yang Chen
TL;DR
This study investigates how data quality and cross-source inconsistencies affect ML-based solar flare forecasting. It compares science-quality, operational SWPC-FTP, and SunPy-HEK lists for flare labels and analyzes imaging (HMI/AIA) and SHARP vector predictors, including augmentation of science-quality data with AR labels. Using two representative models (LSTM and logistic regression) across multiple forecast horizons and solar-cycle phases, it shows that data-product choices yield solar-cycle dependent performance differences and that near-real-time versus definitive predictor data influence stability and accuracy. The paper provides a reproducible data-processing pipeline, practical recommendations for data selection, and insights to improve interpretability and comparability in data-driven flare forecasting.
Abstract
Machine learning models for forecasting solar flares have been trained and tested using a variety of data sources, such as Space Weather Prediction Center (SWPC) operational and science-quality data. Typically, data from these sources is minimally processed before being used to train and validate a forecasting model. However, predictive performance can be impaired if defects in and inconsistencies between these data sources are ignored. For a number of commonly used data sources, together with softwares that query and then output processed data, we identify their respective defects and inconsistencies, quantify their extent, and show how they can affect the predictions produced by data-driven machine learning forecasting models. We also outline procedures for fixing these issues or at least mitigating their impacts. Finally, based on our thorough comparisons of the impacts of data sources on the trained forecasting model in terms of predictive skill scores, we offer recommendations for the use of different data products in operational forecasting.
