Solar flare forecasting with foundational transformer models across image, video, and time-series modalities
S. Riggi, P. Romano, A. Pilzer, U. Becciani
TL;DR
This paper benchmarked three transformer-based backbones—SigLIP2 for images, VideoMAE for videos, and Moirai2 for time-series—to forecast solar flares using SDO/HMI magnetograms and GOES X-ray flux data. Through consistent train/validation/test splits and multiple loss strategies, Moirai2 consistently achieved the highest forecasting skill (TSS ≈ 0.74), underscoring the strong value of temporal coronal emissions for flare prediction. While image and video transformers captured spatial and short-term evolution signals with modest gains (TSS ≈ 0.60–0.65), they trail the time-series approach, suggesting promising avenues for unified multimodal models that integrate magnetic topology with coronal activity. The work emphasizes reproducibility, public release of code and weights, and outlines future directions toward physics-informed, multimodal solar eruption forecasting.
Abstract
We present a comparative study of transformer-based architectures for solar flare forecasting using heterogeneous data modalities, including images, video sequences, and time-series observations. Our analysis evaluates three recent foundational models - SigLIP2 for image encoding, VideoMAE for spatio-temporal video representation, and Moirai2 for multivariate time-series forecasting - applied to publicly available datasets of solar magnetograms from the SDO/HMI mission and soft X-ray fluxes acquired by GOES satellites. All models are trained and validated under consistent data splits and evaluation criteria, with the goal of assessing the strengths and limitations of transformer backbones across spatial and temporal representations of solar activity. We investigate multiple loss formulations (weighted BCE, focal, and score-oriented) and training balance strategies to mitigate class imbalance typical of flare datasets. Results show that while both SigLIP2 and VideoMAE achieve typical performance on image and video data (True Skill Statistic TSS~0.60-0.65), the time-series model Moirai2 reaches superior forecasting skill (TSS~0.74) using irradiance-based temporal evolution alone. These findings highlight the potential of pretrained transformer architectures and cross-modal learning for advancing operational space weather forecasting, paving the way toward unified multimodal models that integrate visual and temporal information.
