Table of Contents
Fetching ...

State-space Decomposition Model for Video Prediction Considering Long-term Motion Trend

Fei Cui, Jiaojiao Fang, Xiaojiang Wu, Zelong Lai, Mengke Yang, Menghan Jia, Guizhong Liu

TL;DR

The paper tackles stochastic video prediction under non-stationary dynamics by proposing a state-space decomposition that separately models stochastic motion and deterministic appearance. A global long-term motion trend $z_1$, inferred from the full conditional sequence via a temporal transformer, guides local dynamics in the motion branch, while appearance evolves deterministically through a ViT-based encoder with a learnable token. The model employs a Gaussian prior on the initial motion $y_1$ and uses variational inference to learn posteriors for latent variables, achieving an ELBO-based objective that encourages accurate frame reconstruction and faithful latent dynamics. Empirically, the approach attains state-of-the-art or competitive results across several datasets (e.g., SMMNIST, BAIR, KTH, Human3.6M, Cityscapes, KITTI), with enhanced long-horizon coherence and clear disentanglement between motion and appearance, validating the effectiveness of incorporating global dynamics for dynamic scenes.

Abstract

Stochastic video prediction enables the consideration of uncertainty in future motion, thereby providing a better reflection of the dynamic nature of the environment. Stochastic video prediction methods based on image auto-regressive recurrent models need to feed their predictions back into the latent space. Conversely, the state-space models, which decouple frame synthesis and temporal prediction, proves to be more efficient. However, inferring long-term temporal information about motion and generalizing to dynamic scenarios under non-stationary assumptions remains an unresolved challenge. In this paper, we propose a state-space decomposition stochastic video prediction model that decomposes the overall video frame generation into deterministic appearance prediction and stochastic motion prediction. Through adaptive decomposition, the model's generalization capability to dynamic scenarios is enhanced. In the context of motion prediction, obtaining a prior on the long-term trend of future motion is crucial. Thus, in the stochastic motion prediction branch, we infer the long-term motion trend from conditional frames to guide the generation of future frames that exhibit high consistency with the conditional frames. Experimental results demonstrate that our model outperforms baselines on multiple datasets.

State-space Decomposition Model for Video Prediction Considering Long-term Motion Trend

TL;DR

The paper tackles stochastic video prediction under non-stationary dynamics by proposing a state-space decomposition that separately models stochastic motion and deterministic appearance. A global long-term motion trend , inferred from the full conditional sequence via a temporal transformer, guides local dynamics in the motion branch, while appearance evolves deterministically through a ViT-based encoder with a learnable token. The model employs a Gaussian prior on the initial motion and uses variational inference to learn posteriors for latent variables, achieving an ELBO-based objective that encourages accurate frame reconstruction and faithful latent dynamics. Empirically, the approach attains state-of-the-art or competitive results across several datasets (e.g., SMMNIST, BAIR, KTH, Human3.6M, Cityscapes, KITTI), with enhanced long-horizon coherence and clear disentanglement between motion and appearance, validating the effectiveness of incorporating global dynamics for dynamic scenes.

Abstract

Stochastic video prediction enables the consideration of uncertainty in future motion, thereby providing a better reflection of the dynamic nature of the environment. Stochastic video prediction methods based on image auto-regressive recurrent models need to feed their predictions back into the latent space. Conversely, the state-space models, which decouple frame synthesis and temporal prediction, proves to be more efficient. However, inferring long-term temporal information about motion and generalizing to dynamic scenarios under non-stationary assumptions remains an unresolved challenge. In this paper, we propose a state-space decomposition stochastic video prediction model that decomposes the overall video frame generation into deterministic appearance prediction and stochastic motion prediction. Through adaptive decomposition, the model's generalization capability to dynamic scenarios is enhanced. In the context of motion prediction, obtaining a prior on the long-term trend of future motion is crucial. Thus, in the stochastic motion prediction branch, we infer the long-term motion trend from conditional frames to guide the generation of future frames that exhibit high consistency with the conditional frames. Experimental results demonstrate that our model outperforms baselines on multiple datasets.
Paper Structure (16 sections, 9 equations, 9 figures, 2 tables)

This paper contains 16 sections, 9 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Generative model $p$ and Inference model $q$ of our method, where circles and diamonds represent stochastic and deterministic variables, respectively. (a) In the generative model, the global motion trend variable $\bm{z}_1$ is generated from the conditional frames $\bm{x}_{1:k}$ (here $k=2$), and the local dynamic variable $\bm{z}_t$ is generated from the previous motion variable $\bm{y}_{t-1}$. (b) In the inference model, $\bm{z}_1$ is inferred from the complete sequence $\bm{x}_{1:T}$, and the local dynamic $\bm{z}_t$ is inferred from the frame sequence $\bm{x}_{1:t}$. The motion variable $\bm{y}_t$ and the appearance variable $\bm{w}_t$ are jointly decoded to generate the frame $\hat{\bm{x}_t}$.
  • Figure 2: Framework of our method. The original frames $\bm{x}_{1:T}$ are mapped to a latent space through an encoder, and a LSTM captures the temporal dynamics within this latent space in the motion prediction branch. In the appearance prediction branch, a ViT is employed to encode the static features related to the background. To encourage the motion varibales to disregard static features, a standard Gaussian prior is applied to the motion variables (right). The prior and posterior of the global dynamic variable $\bm{z}_1$ are inferred from the conditional frames $\bm{x}_{1:k}$ and input frames $\bm{x}_{1:T}$, respectively (middle). The frame $\hat{\bm{x}_t}$ is jointly decoded from the appearance variable $\bm{w}_t$ and the motion variable $\bm{y}_t$ (left). The training pipeline and testing pipeline are detailed in Appendix B.
  • Figure 3: The PSNR scores over timestep t for our proposed method and various baselines. Each score represents the mean value obtained from five different samples generated by the models. Our proposed model achieved superior performance on the KTH, Human3.6M and Cityscapes datasets, while demonstrating comparable performance to state-of-the-art models on the BAIR, SMMNIST and KITTI datasets in terms of the PSNR metric.
  • Figure 4: Person walking. The top row shows the ground truth, followed by the predictions from SRVP, SLAMP and our method.
  • Figure 5: Overlapping digits. This figure shows two overlapping digits and the predictions from SRVP, SLAMP and our method.
  • ...and 4 more figures