Action-conditioned video data improves predictability
Meenakshi Sarkar, Debasish Ghose
TL;DR
The paper tackles long-horizon video prediction under partial observability when the recording camera moves, by explicitly modeling the coupling between environment dynamics and robot actions. It introduces Action-Conditioned Video Generation (ACVG), a dual Generator-Actor architecture where the Generator predicts future frames $ ilde{x}_{t+1}$ using an augmented flow input $ ilde{O}_t=[ ilde{o}_t,a_t]$, and the Actor forecasts the next action $ ilde{a}_{t+1}$ based on latent state information, forming a causal feedback loop. Training proceeds in three phases (Generator, Actor, then Dual) with a joint loss that combines reconstruction losses for $x_t$, $o_t$, and $a_t$, plus optional adversarial loss to sharpen visuals. Experiments on the RoAM dataset show that ACVG outperforms ACPNet, VANet, and ACVG-fa on frame-wise perceptual metrics (e.g., VGG16 cosine similarity, LPIPS) and overall temporal coherence (FVD), validating that modeling action dynamics improves prediction accuracy in partially observable robotics scenarios. The work has practical implications for reinforcement learning and planning in dynamic environments, where accurate anticipation of both vision and control signals is crucial.
Abstract
Long-term video generation and prediction remain challenging tasks in computer vision, particularly in partially observable scenarios where cameras are mounted on moving platforms. The interaction between observed image frames and the motion of the recording agent introduces additional complexities. To address these issues, we introduce the Action-Conditioned Video Generation (ACVG) framework, a novel approach that investigates the relationship between actions and generated image frames through a deep dual Generator-Actor architecture. ACVG generates video sequences conditioned on the actions of robots, enabling exploration and analysis of how vision and action mutually influence one another in dynamic environments. We evaluate the framework's effectiveness on an indoor robot motion dataset which consists of sequences of image frames along with the sequences of actions taken by the robotic agent, conducting a comprehensive empirical study comparing ACVG to other state-of-the-art frameworks along with a detailed ablation study.
