Solar Active Region Magnetogram Image Dataset for Studies of Space Weather
Laura E. Boucheron, Ty Vincent, Jeremy A. Grajeda, Ellery Wuest
TL;DR
This work provides a comprehensive, reproducible magnetogram dataset for space-weather research by integrating NOAA AR catalogs, SDO/HMI magnetograms, and GOES flare labels into a fixed-size, minimally processed dataset available in preconfigured and reduced forms. The authors detail end-to-end data preparation, including AR identification, automated magnetogram download, flare labeling within configurable prediction windows, and stratified train/validation/test splits, enabling robust ML benchmarking. They validate the dataset with baseline magnetic-complexity features using an SVM and with transfer learning via a VGG16 CNN, obtaining competitive performance and demonstrating the utility of both traditional and deep-learning approaches for flare prediction. Overall, the dataset enables reproducible, scalable experiments in solar flare forecasting, with configurable filtering by latitude/longitude, NaN handling, and downsized variants to support rapid experimentation and benchmarking in space-weather research.
Abstract
In this dataset we provide a comprehensive collection of magnetograms (images quantifying the strength of the magnetic field) from the National Aeronautics and Space Administration's (NASA's) Solar Dynamics Observatory (SDO). The dataset incorporates data from three sources and provides SDO Helioseismic and Magnetic Imager (HMI) magnetograms of solar active regions (regions of large magnetic flux, generally the source of eruptive events) as well as labels of corresponding flaring activity. This dataset will be useful for image analysis or solar physics research related to magnetic structure, its evolution over time, and its relation to solar flares. The dataset will be of interest to those researchers investigating automated solar flare prediction methods, including supervised and unsupervised machine learning (classical and deep), binary and multi-class classification, and regression. This dataset is a minimally processed, user configurable dataset of consistently sized images of solar active regions that can serve as a benchmark dataset for solar flare prediction research.
