Table of Contents
Fetching ...

USF-Net: A Unified Spatiotemporal Fusion Network for Ground-Based Remote Sensing Cloud Image Sequence Extrapolation

Penghui Niu, Taotao Cai, Jiashuai She, Yajuan Zhang, Junhua Gua, Ping Zhanga, Jungong Hane, Jianxin Li

TL;DR

USF-Net addresses the challenge of ultra-short-term ground-based cloud sequence extrapolation for photovoltaic forecasting by introducing a unified spatiotemporal architecture that explicitly couples temporal flow with spatial feature learning. It combines a Unified Spatiotemporal Module (USTM), featuring a Spatial Information Branch with a Dynamic Large-Kernel Selection, a Temporal Information Branch with a Temporal Agent Attention Module, and a Dynamic Spatiotemporal Module that fuses context via a Temporal Guidance Module, plus a Dynamic Update Module in decoding to mitigate ghosting. The approach is validated on the newly released ASI-CIS dataset, where USF-Net achieves state-of-the-art accuracy with favorable efficiency, supported by comprehensive ablations illustrating the contribution of each module. The work has practical impact for ultra-short-term PV power forecasting and cloud monitoring, and the ASI-CIS dataset and code release will facilitate further research in high-resolution, multi-scale cloud extrapolation.

Abstract

Ground-based remote sensing cloud image sequence extrapolation is a key research area in the development of photovoltaic power systems. However, existing approaches exhibit several limitations:(1)they primarily rely on static kernels to augment feature information, lacking adaptive mechanisms to extract features at varying resolutions dynamically;(2)temporal guidance is insufficient, leading to suboptimal modeling of long-range spatiotemporal dependencies; and(3)the quadratic computational cost of attention mechanisms is often overlooked, limiting efficiency in practical deployment. To address these challenges, we propose USF-Net, a Unified Spatiotemporal Fusion Network that integrates adaptive large-kernel convolutions and a low-complexity attention mechanism, combining temporal flow information within an encoder-decoder framework. Specifically, the encoder employs three basic layers to extract features. Followed by the USTM, which comprises:(1)a SiB equipped with a SSM that dynamically captures multi-scale contextual information, and(2)a TiB featuring a TAM that effectively models long-range temporal dependencies while maintaining computational efficiency. In addition, a DSM with a TGM is introduced to enable unified modeling of temporally guided spatiotemporal dependencies. On the decoder side, a DUM is employed to address the common "ghosting effect." It utilizes the initial temporal state as an attention operator to preserve critical motion signatures. As a key contribution, we also introduce and release the ASI-CIS dataset. Extensive experiments on ASI-CIS demonstrate that USF-Net significantly outperforms state-of-the-art methods, establishing a superior balance between prediction accuracy and computational efficiency for ground-based cloud extrapolation. The dataset and source code will be available at https://github.com/she1110/ASI-CIS.

USF-Net: A Unified Spatiotemporal Fusion Network for Ground-Based Remote Sensing Cloud Image Sequence Extrapolation

TL;DR

USF-Net addresses the challenge of ultra-short-term ground-based cloud sequence extrapolation for photovoltaic forecasting by introducing a unified spatiotemporal architecture that explicitly couples temporal flow with spatial feature learning. It combines a Unified Spatiotemporal Module (USTM), featuring a Spatial Information Branch with a Dynamic Large-Kernel Selection, a Temporal Information Branch with a Temporal Agent Attention Module, and a Dynamic Spatiotemporal Module that fuses context via a Temporal Guidance Module, plus a Dynamic Update Module in decoding to mitigate ghosting. The approach is validated on the newly released ASI-CIS dataset, where USF-Net achieves state-of-the-art accuracy with favorable efficiency, supported by comprehensive ablations illustrating the contribution of each module. The work has practical impact for ultra-short-term PV power forecasting and cloud monitoring, and the ASI-CIS dataset and code release will facilitate further research in high-resolution, multi-scale cloud extrapolation.

Abstract

Ground-based remote sensing cloud image sequence extrapolation is a key research area in the development of photovoltaic power systems. However, existing approaches exhibit several limitations:(1)they primarily rely on static kernels to augment feature information, lacking adaptive mechanisms to extract features at varying resolutions dynamically;(2)temporal guidance is insufficient, leading to suboptimal modeling of long-range spatiotemporal dependencies; and(3)the quadratic computational cost of attention mechanisms is often overlooked, limiting efficiency in practical deployment. To address these challenges, we propose USF-Net, a Unified Spatiotemporal Fusion Network that integrates adaptive large-kernel convolutions and a low-complexity attention mechanism, combining temporal flow information within an encoder-decoder framework. Specifically, the encoder employs three basic layers to extract features. Followed by the USTM, which comprises:(1)a SiB equipped with a SSM that dynamically captures multi-scale contextual information, and(2)a TiB featuring a TAM that effectively models long-range temporal dependencies while maintaining computational efficiency. In addition, a DSM with a TGM is introduced to enable unified modeling of temporally guided spatiotemporal dependencies. On the decoder side, a DUM is employed to address the common "ghosting effect." It utilizes the initial temporal state as an attention operator to preserve critical motion signatures. As a key contribution, we also introduce and release the ASI-CIS dataset. Extensive experiments on ASI-CIS demonstrate that USF-Net significantly outperforms state-of-the-art methods, establishing a superior balance between prediction accuracy and computational efficiency for ground-based cloud extrapolation. The dataset and source code will be available at https://github.com/she1110/ASI-CIS.

Paper Structure

This paper contains 23 sections, 19 equations, 11 figures, 6 tables, 1 algorithm.

Figures (11)

  • Figure 1: (a) illustrates multi-scale cloud movement. The red, yellow, and blue blocks represent displacement vectors of large, medium, and small-scale clouds, respectively. The arrow indicates the direction of the movement trend. (b) demonstrates “ghosting effects” in cloud image sequence extrapolation. The orange block denotes ground truth (GT), whereas the green block indicates extrapolation results.
  • Figure 2: (a) The structure of the proposed USF-Net is composed of three parts: the encoder comprises three Basic Layers, the USTM and the decoder comprises a dynamic update module (DUM). $C_{i}$ denotes the channel of the feature map. (b) The structure of the encoder, where $N_{1}$, $N_{2}$, and $N_{3}$ are 2, 2, and 3, respectively. The output of the encoder is $X_{B}$. (c) The specific structure of the Basic Layer. (d) The diagram of the proposed Unified SpatioTemporal Module (USTM) comprises three core components: a spatial information branch (SiB), a temporal information branch (TiB), and a dynamic spatiotemporal module (DSM). The output of the USTM is $X_{T}$.
  • Figure 3: The structure of the proposed SiB. The SSM employs explicitly decomposed convolution operations to generate varying receptive field sizes, thereby enhancing the network's multi-scale representational capacity.
  • Figure 4: (a) The overall structure of the proposed TiB. (b) The proposed CE consists of a $3\times3$ convolution, batch normalization (BN), ReLU activation, and a DW convolutional layer with residual connections. (c) The proposed DSM, the dashed line denotes the feature flow of Agent attention, and the solid line denotes the feature flow of Softmax attention.
  • Figure 5: The structure of the proposed DSM. The bottom-hand side of the figure shows the structure of the TGM in detail. The learnable dynamic convolution kernels are generated by applying weighted guidance from temporal flow information to spatial feature maps utilizing temporal flow information.
  • ...and 6 more figures