Table of Contents
Fetching ...

WaveSFNet: A Wavelet-Based Codec and Spatial--Frequency Dual-Domain Gating Network for Spatiotemporal Prediction

Xinyong Cai, Runming Xie, Hu Chen, Yuankai Wu

Abstract

Spatiotemporal predictive learning aims to forecast future frames from historical observations in an unsupervised manner, and is critical to a wide range of applications. The key challenge is to model long-range dynamics while preserving high-frequency details for sharp multi-step predictions. Existing efficient recurrent-free frameworks typically rely on strided convolutions or pooling for sampling, which tends to discard textures and boundaries, while purely spatial operators often struggle to balance local interactions with global propagation. To address these issues, we propose WaveSFNet, an efficient framework that unifies a wavelet-based codec with a spatial--frequency dual-domain gated spatiotemporal translator. The wavelet-based codec preserves high-frequency subband cues during downsampling and reconstruction. Meanwhile, the translator first injects adjacent-frame differences to explicitly enhance dynamic information, and then performs dual-domain gated fusion between large-kernel spatial local modeling and frequency-domain global modulation, together with gated channel interaction for cross-channel feature exchange. Extensive experiments demonstrate that WaveSFNet achieves competitive prediction accuracy on Moving MNIST, TaxiBJ, and WeatherBench, while maintaining low computational complexity. Our code is available at https://github.com/fhjdqaq/WaveSFNet.

WaveSFNet: A Wavelet-Based Codec and Spatial--Frequency Dual-Domain Gating Network for Spatiotemporal Prediction

Abstract

Spatiotemporal predictive learning aims to forecast future frames from historical observations in an unsupervised manner, and is critical to a wide range of applications. The key challenge is to model long-range dynamics while preserving high-frequency details for sharp multi-step predictions. Existing efficient recurrent-free frameworks typically rely on strided convolutions or pooling for sampling, which tends to discard textures and boundaries, while purely spatial operators often struggle to balance local interactions with global propagation. To address these issues, we propose WaveSFNet, an efficient framework that unifies a wavelet-based codec with a spatial--frequency dual-domain gated spatiotemporal translator. The wavelet-based codec preserves high-frequency subband cues during downsampling and reconstruction. Meanwhile, the translator first injects adjacent-frame differences to explicitly enhance dynamic information, and then performs dual-domain gated fusion between large-kernel spatial local modeling and frequency-domain global modulation, together with gated channel interaction for cross-channel feature exchange. Extensive experiments demonstrate that WaveSFNet achieves competitive prediction accuracy on Moving MNIST, TaxiBJ, and WeatherBench, while maintaining low computational complexity. Our code is available at https://github.com/fhjdqaq/WaveSFNet.
Paper Structure (20 sections, 8 equations, 4 figures, 6 tables)

This paper contains 20 sections, 8 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Performance comparison on the TaxiBJ dataset. Bubble size denotes FLOPs. WaveSFNet achieves the lowest MSE with reduced complexity.
  • Figure 2: Overall architecture and core modules of WaveSFNet. WaveSFNet follows an encoder--translator--decoder design. A wavelet-based multi-scale encoder extracts latent features from input frames. A TDI Block injects adjacent-frame differences, and $N_t$ stacked ST Blocks apply spatial--frequency dual-domain gating after packing time into channels. A wavelet-symmetric decoder reconstructs predictions.
  • Figure 3: Qualitative visualizations of WaveSFNet on TaxiBJ.
  • Figure 4: Frequency spectrum analysis on the WeatherBench T2M dataset. The plot displays the Radially Averaged Power Spectral Density, where the x-axis represents the radial spatial frequency and the y-axis denotes the log power spectrum. The inset provides a zoomed-in view of the high-frequency region.