USF-Net: A Unified Spatiotemporal Fusion Network for Ground-Based Remote Sensing Cloud Image Sequence Extrapolation
Penghui Niu, Taotao Cai, Jiashuai She, Yajuan Zhang, Junhua Gua, Ping Zhanga, Jungong Hane, Jianxin Li
TL;DR
USF-Net addresses the challenge of ultra-short-term ground-based cloud sequence extrapolation for photovoltaic forecasting by introducing a unified spatiotemporal architecture that explicitly couples temporal flow with spatial feature learning. It combines a Unified Spatiotemporal Module (USTM), featuring a Spatial Information Branch with a Dynamic Large-Kernel Selection, a Temporal Information Branch with a Temporal Agent Attention Module, and a Dynamic Spatiotemporal Module that fuses context via a Temporal Guidance Module, plus a Dynamic Update Module in decoding to mitigate ghosting. The approach is validated on the newly released ASI-CIS dataset, where USF-Net achieves state-of-the-art accuracy with favorable efficiency, supported by comprehensive ablations illustrating the contribution of each module. The work has practical impact for ultra-short-term PV power forecasting and cloud monitoring, and the ASI-CIS dataset and code release will facilitate further research in high-resolution, multi-scale cloud extrapolation.
Abstract
Ground-based remote sensing cloud image sequence extrapolation is a key research area in the development of photovoltaic power systems. However, existing approaches exhibit several limitations:(1)they primarily rely on static kernels to augment feature information, lacking adaptive mechanisms to extract features at varying resolutions dynamically;(2)temporal guidance is insufficient, leading to suboptimal modeling of long-range spatiotemporal dependencies; and(3)the quadratic computational cost of attention mechanisms is often overlooked, limiting efficiency in practical deployment. To address these challenges, we propose USF-Net, a Unified Spatiotemporal Fusion Network that integrates adaptive large-kernel convolutions and a low-complexity attention mechanism, combining temporal flow information within an encoder-decoder framework. Specifically, the encoder employs three basic layers to extract features. Followed by the USTM, which comprises:(1)a SiB equipped with a SSM that dynamically captures multi-scale contextual information, and(2)a TiB featuring a TAM that effectively models long-range temporal dependencies while maintaining computational efficiency. In addition, a DSM with a TGM is introduced to enable unified modeling of temporally guided spatiotemporal dependencies. On the decoder side, a DUM is employed to address the common "ghosting effect." It utilizes the initial temporal state as an attention operator to preserve critical motion signatures. As a key contribution, we also introduce and release the ASI-CIS dataset. Extensive experiments on ASI-CIS demonstrate that USF-Net significantly outperforms state-of-the-art methods, establishing a superior balance between prediction accuracy and computational efficiency for ground-based cloud extrapolation. The dataset and source code will be available at https://github.com/she1110/ASI-CIS.
