Action-conditioned video data improves predictability

Meenakshi Sarkar; Debasish Ghose

Action-conditioned video data improves predictability

Meenakshi Sarkar, Debasish Ghose

TL;DR

The paper tackles long-horizon video prediction under partial observability when the recording camera moves, by explicitly modeling the coupling between environment dynamics and robot actions. It introduces Action-Conditioned Video Generation (ACVG), a dual Generator-Actor architecture where the Generator predicts future frames $ ilde{x}_{t+1}$ using an augmented flow input $ ilde{O}_t=[ ilde{o}_t,a_t]$, and the Actor forecasts the next action $ ilde{a}_{t+1}$ based on latent state information, forming a causal feedback loop. Training proceeds in three phases (Generator, Actor, then Dual) with a joint loss that combines reconstruction losses for $x_t$, $o_t$, and $a_t$, plus optional adversarial loss to sharpen visuals. Experiments on the RoAM dataset show that ACVG outperforms ACPNet, VANet, and ACVG-fa on frame-wise perceptual metrics (e.g., VGG16 cosine similarity, LPIPS) and overall temporal coherence (FVD), validating that modeling action dynamics improves prediction accuracy in partially observable robotics scenarios. The work has practical implications for reinforcement learning and planning in dynamic environments, where accurate anticipation of both vision and control signals is crucial.

Abstract

Long-term video generation and prediction remain challenging tasks in computer vision, particularly in partially observable scenarios where cameras are mounted on moving platforms. The interaction between observed image frames and the motion of the recording agent introduces additional complexities. To address these issues, we introduce the Action-Conditioned Video Generation (ACVG) framework, a novel approach that investigates the relationship between actions and generated image frames through a deep dual Generator-Actor architecture. ACVG generates video sequences conditioned on the actions of robots, enabling exploration and analysis of how vision and action mutually influence one another in dynamic environments. We evaluate the framework's effectiveness on an indoor robot motion dataset which consists of sequences of image frames along with the sequences of actions taken by the robotic agent, conducting a comprehensive empirical study comparing ACVG to other state-of-the-art frameworks along with a detailed ablation study.

Action-conditioned video data improves predictability

TL;DR

using an augmented flow input

, and the Actor forecasts the next action

based on latent state information, forming a causal feedback loop. Training proceeds in three phases (Generator, Actor, then Dual) with a joint loss that combines reconstruction losses for

, and

, plus optional adversarial loss to sharpen visuals. Experiments on the RoAM dataset show that ACVG outperforms ACPNet, VANet, and ACVG-fa on frame-wise perceptual metrics (e.g., VGG16 cosine similarity, LPIPS) and overall temporal coherence (FVD), validating that modeling action dynamics improves prediction accuracy in partially observable robotics scenarios. The work has practical implications for reinforcement learning and planning in dynamic environments, where accurate anticipation of both vision and control signals is crucial.

Abstract

Paper Structure (8 sections, 21 equations, 5 figures, 1 table)

This paper contains 8 sections, 21 equations, 5 figures, 1 table.

Introduction
Action Conditioned Video Generation
Generator Network
Actor Network
Loss and Training Loop
RoAM dataset and Experimental Setup
Results and Discussion
Conclusion

Figures (5)

Figure 1: The interdependent training loop of the Generator and Actor network of ACVG.
Figure 2: Architecture of ACVG consists of dual networks: Generator and Actor. Fig. \ref{['fig:acvg_generator']} shows the architecture of the generator network alone during the generator training phase. During this phase, $\beta=1$ and $\gamma=0$ for the training loss in \ref{['eq:loss_ln_expand']} and $a_t$ is constant. Fig. \ref{['fig:acvg_actor']} shows the dual configuration of the Generator-Actor framework during the actor training phase. During this phase the weights of the pre-trained generator network is kept frozen. The generator network is used in the inference mode with $\beta=0$ and $\gamma=1$ in \ref{['eq:loss_ln_expand']}. Finally \ref{['fig:acvg_dual']} shows the configuration of the network in dual training mode when $\beta=1$ and $\gamma=1$.It's important to emphasize that in the dual training phase, as depicted in Figure \ref{['fig:acvg_dual']}, we employ a delayed actor-network to provide the approximate current action $\tilde{a}_t$ to the generator network. This delay is necessary because we operate with a causal model, wherein the actor network generates the predicted action $\tilde{a}_{t+1}$ based on the observation of state $\tilde{\chi}_{t+1}$.
Figure 3: A Frame-wise quantitative analysis of ACVG, ACVG-fa, ACPNet, VANet on RoAM dataset for predicting 20 frames into the future based on the past history of 5 frames. Starting from the left, we have plotted the mean performance index for VGG 16 Cosine Similarity (higher is better), LPIPS score (lower is better), and PSNR (higher is better) on the test set. \ref{['fig:action1_error']} shows $\text{L}_2$ error between the normalised forward velocity action predicted by the Actor-network of ACVG and the ground truth.
Figure 4: A Frame-wise ablation study on ACVG, ACVG-fa, ACPNet and VANet on RoAM dataset. Fig. \ref{['fig:roam_acvg_half_vgg16']} and \ref{['fig:roam_acvg_half_lpips']} shows the VGG 16 Cosine Similarity (higher is better) and LPIPS score (lower is better) respectively for predicting 15 frames into the future from past 5 frames at 0.5 fps$_{\text{train}}$ or $\Delta t_{\text{test}}=2\times\Delta t_{\text{train}}$ . Fig. \ref{['fig:roam_acvg_dis_vgg16']} and \ref{['fig:roam_acvg_dis_lpips']} plots the mean VGG 16 Cosine Similarity and LPIPS score with 95% confidence for predicting 20 frames from past 5 frames in the presence of a random perturbation $\mathcal{N}(0,0.2)$ in the action value.
Figure 5: Predicted raw image frames along with the corresponding forward velocity values for ACVG, ACVG-fa, ACPNet and VANet on RoAM dataset for qualitative performance analysis. Models predicted 20 future frames based on the past 5 frames.

Action-conditioned video data improves predictability

TL;DR

Abstract

Action-conditioned video data improves predictability

Authors

TL;DR

Abstract

Table of Contents

Figures (5)