Video Generation with Learned Action Prior

Meenakshi Sarkar; Devansh Bhardwaj; Debasish Ghose

Video Generation with Learned Action Prior

Meenakshi Sarkar, Devansh Bhardwaj, Debasish Ghose

TL;DR

Three models are introduced, which treat the image-action pair as an augmented state generated from a single latent stochastic process and uses variational inference to learn the image-action latent prior; Causal-LeAP, which establishes a causal relationship between action and the observed image frame at time $t$, learning an action prior conditioned on the observed image states; and RAFI, which integrates the augmented image-action state concept into flow matching with diffusion generative processes.

Abstract

Stochastic video generation is particularly challenging when the camera is mounted on a moving platform, as camera motion interacts with observed image pixels, creating complex spatio-temporal dynamics and making the problem partially observable. Existing methods typically address this by focusing on raw pixel-level image reconstruction without explicitly modelling camera motion dynamics. We propose a solution by considering camera motion or action as part of the observed image state, modelling both image and action within a multi-modal learning framework. We introduce three models: Video Generation with Learning Action Prior (VG-LeAP) treats the image-action pair as an augmented state generated from a single latent stochastic process and uses variational inference to learn the image-action latent prior; Causal-LeAP, which establishes a causal relationship between action and the observed image frame at time $t$, learning an action prior conditioned on the observed image states; and RAFI, which integrates the augmented image-action state concept into flow matching with diffusion generative processes, demonstrating that this action-conditioned image generation concept can be extended to other diffusion-based models. We emphasize the importance of multi-modal training in partially observable video generation problems through detailed empirical studies on our new video action dataset, RoAM.

Video Generation with Learned Action Prior

TL;DR

, learning an action prior conditioned on the observed image states; and RAFI, which integrates the augmented image-action state concept into flow matching with diffusion generative processes.

Abstract

, learning an action prior conditioned on the observed image states; and RAFI, which integrates the augmented image-action state concept into flow matching with diffusion generative processes, demonstrating that this action-conditioned image generation concept can be extended to other diffusion-based models. We emphasize the importance of multi-modal training in partially observable video generation problems through detailed empirical studies on our new video action dataset, RoAM.

Paper Structure (14 sections, 34 equations, 4 figures, 1 table, 1 algorithm)

This paper contains 14 sections, 34 equations, 4 figures, 1 table, 1 algorithm.

Introduction
Prior works
Action conditioned video generation
Video generation with learnt action prior
Causal video generation with learned action prior
Random Action-Frame Conditioned Flow Integrating video generation model
Dataset and Experiments:
RoAM dataset:
Experimental Setup:
Results and Discussion:
Concusion:
Appendix
Variational Lower Bound for Video Generation with Learned Action Prior
Variational Lower Bound for Causal Video Generation with Learned Action Prior

Figures (4)

Figure 1: Fig \ref{['fig:vleap_state']} shows the state flow diagram and generation model for the VG-LeAP model with learned image-action prior $z_t$ that is dependent on the image action pair $(x_t,a_t)$. Fig \ref{['fig:vleap_block']} depicts the architecture of video generation with learned action prior model (red color dotted boxed portion) along with the posterior network in green color dotted boxed portion. At the time of inference only the prior model (red colored) is used. The prior and posterior latent models are trained using KL divergence loss.
Figure 2: Fig \ref{['fig:causal_state']} shows the state flow diagram and generation model for the Causal-LeAP model with learned action prior $u_t$ that is dependent on the learned image prior $z_t$ The forward causal relationship between image latent state $z_t$ image and action latent variable $u_t$ is depicted via the blue continuous connecting line. The dotted lines from $a_{t-1}$ to $x_t$ represent the dependency between past actions and future observed images. Fig \ref{['fig:causal_block']} depicts the architecture of video generation with learned both the action prior and image prior models (red colour dotted boxed portion). The posterior networks are shown in green colour-dotted boxed portions. At the time of inference only the prior models (red coloured) are used in the forward pass to sample $z_t$ and $u_t$ to generate $\tilde{x}_t$ and $\tilde{a}_t$. The prior and posterior latent models are trained using KL divergence loss.
Figure 3: Fig. \ref{['fig:final_lpips']} (lower is better),\ref{['fig:final_vgg']}(higher is better) and \ref{['fig:final_psnr']}(higher is better) showing the quantitative performance of Causal-LeAP, VG-LeAP, SVG (SVG-lp), RAFI,SRVP, and ACPNet for 20 different sampling on predicting 20 future image frames from past 5 conditioning frames. In all the quantitative performance metrics, Causal-LeAP model outperforms the other 5. In the case of LPIPS values for RAFI and ACPNet, we can see that both these models start much better than Causal-Leap, however as time passes, both start performing much worse than LeAP models. However, the reason for this performance degradation is completely different in the case of these two models as explained in Sec. \ref{['sec:results']}
Figure 4: Fig. \ref{['fig:final_at1_f']},\ref{['fig:final_at1_r']} and \ref{['fig:final_at2']} show the quantitative L$_2$ norm error between the predicted action values and the ground truth for Causal-LeAP, VG-LeAP and RAFI. In Fig. \ref{['fig:final_at1_f']} we have shown the error in the normalised forward velocity between Causal-LeAP, VG-LeAP and RAFI. In this case, it can be seen that even though initially all the 3 models perform similarly to each other, as time increases, Vg-LeAP produces much more noisy and erroneous predictions compared to the other two models. Fig. \ref{['fig:final_at1_r']} shows that when compared between Causal-LeAP and RAFI, RAFI performs much better in time when it comes to predicting the forward velocity. Fig. \ref{['fig:final_at2']} shows that in the case of angular rotation or turn rate, Causal-LeAP provides the best predictions and RAFI performs the worst.

Video Generation with Learned Action Prior

TL;DR

Abstract

Video Generation with Learned Action Prior

Authors

TL;DR

Abstract

Table of Contents

Figures (4)