State-space Decomposition Model for Video Prediction Considering Long-term Motion Trend
Fei Cui, Jiaojiao Fang, Xiaojiang Wu, Zelong Lai, Mengke Yang, Menghan Jia, Guizhong Liu
TL;DR
The paper tackles stochastic video prediction under non-stationary dynamics by proposing a state-space decomposition that separately models stochastic motion and deterministic appearance. A global long-term motion trend $z_1$, inferred from the full conditional sequence via a temporal transformer, guides local dynamics in the motion branch, while appearance evolves deterministically through a ViT-based encoder with a learnable token. The model employs a Gaussian prior on the initial motion $y_1$ and uses variational inference to learn posteriors for latent variables, achieving an ELBO-based objective that encourages accurate frame reconstruction and faithful latent dynamics. Empirically, the approach attains state-of-the-art or competitive results across several datasets (e.g., SMMNIST, BAIR, KTH, Human3.6M, Cityscapes, KITTI), with enhanced long-horizon coherence and clear disentanglement between motion and appearance, validating the effectiveness of incorporating global dynamics for dynamic scenes.
Abstract
Stochastic video prediction enables the consideration of uncertainty in future motion, thereby providing a better reflection of the dynamic nature of the environment. Stochastic video prediction methods based on image auto-regressive recurrent models need to feed their predictions back into the latent space. Conversely, the state-space models, which decouple frame synthesis and temporal prediction, proves to be more efficient. However, inferring long-term temporal information about motion and generalizing to dynamic scenarios under non-stationary assumptions remains an unresolved challenge. In this paper, we propose a state-space decomposition stochastic video prediction model that decomposes the overall video frame generation into deterministic appearance prediction and stochastic motion prediction. Through adaptive decomposition, the model's generalization capability to dynamic scenarios is enhanced. In the context of motion prediction, obtaining a prior on the long-term trend of future motion is crucial. Thus, in the stochastic motion prediction branch, we infer the long-term motion trend from conditional frames to guide the generation of future frames that exhibit high consistency with the conditional frames. Experimental results demonstrate that our model outperforms baselines on multiple datasets.
