Table of Contents
Fetching ...

Flow and Depth Assisted Video Prediction with Latent Transformer

Eliyas Suleyman, Paul Henderson, Eksan Firkat, Nicolas Pugeault

TL;DR

We address occlusion in near-future video prediction by conditioning on explicit motion and geometry cues. Our method extends the latent-transformer SCAT with point-flow from Cotracker and depth maps from DepthAnything-V2, concatenated with RGB frames, and trained in a two-stage pipeline (OAAE encoder, transformer predictor). We propose variants SCAT, SCAT-P, SCAT-D, SCAT-PD and evaluate on Kubric Occlusion and KITTI using appearance metrics (PSNR, SSIM, LPIPS) and motion metrics (OFD, EMD) and show that adding point-flow and depth improves occlusion handling and background motion estimation. This modality-augmented, object-centric approach achieves competitive motion accuracy with smaller models and offers a practical path for robust world-modeling in robotics.

Abstract

Video prediction is a fundamental task for various downstream applications, including robotics and world modeling. Although general video prediction models have achieved remarkable performance in standard scenarios, occlusion is still an inherent challenge in video prediction. We hypothesize that providing explicit information about motion (via point-flow) and geometric structure (via depth-maps) will enable video prediction models to perform better in situations with occlusion and the background motion. To investigate this, we present the first systematic study dedicated to occluded video prediction. We use a standard multi-object latent transformer architecture to predict future frames, but modify this to incorporate information from depth and point-flow. We evaluate this model in a controlled setting on both synthetic and real-world datasets with not only appearance-based metrics but also Wasserstein distances on object masks, which can effectively measure the motion distribution of the prediction. We find that when the prediction model is assisted with point flow and depth, it performs better in occluded scenarios and predicts more accurate background motion compared to models without the help of these modalities.

Flow and Depth Assisted Video Prediction with Latent Transformer

TL;DR

We address occlusion in near-future video prediction by conditioning on explicit motion and geometry cues. Our method extends the latent-transformer SCAT with point-flow from Cotracker and depth maps from DepthAnything-V2, concatenated with RGB frames, and trained in a two-stage pipeline (OAAE encoder, transformer predictor). We propose variants SCAT, SCAT-P, SCAT-D, SCAT-PD and evaluate on Kubric Occlusion and KITTI using appearance metrics (PSNR, SSIM, LPIPS) and motion metrics (OFD, EMD) and show that adding point-flow and depth improves occlusion handling and background motion estimation. This modality-augmented, object-centric approach achieves competitive motion accuracy with smaller models and offers a practical path for robust world-modeling in robotics.

Abstract

Video prediction is a fundamental task for various downstream applications, including robotics and world modeling. Although general video prediction models have achieved remarkable performance in standard scenarios, occlusion is still an inherent challenge in video prediction. We hypothesize that providing explicit information about motion (via point-flow) and geometric structure (via depth-maps) will enable video prediction models to perform better in situations with occlusion and the background motion. To investigate this, we present the first systematic study dedicated to occluded video prediction. We use a standard multi-object latent transformer architecture to predict future frames, but modify this to incorporate information from depth and point-flow. We evaluate this model in a controlled setting on both synthetic and real-world datasets with not only appearance-based metrics but also Wasserstein distances on object masks, which can effectively measure the motion distribution of the prediction. We find that when the prediction model is assisted with point flow and depth, it performs better in occluded scenarios and predicts more accurate background motion compared to models without the help of these modalities.

Paper Structure

This paper contains 20 sections, 6 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: The overview of the proposed method. First we obtain different modalities by using Cotracker and DepthAnythingV2; then we use SAM2 to segment the original RGB frames sequence to decompose the objects, segmentation map from SAM2 is also used to decompose the point-flow and depth map; After preprocessing, we first train OAAE to convert the frames into a latent space; then we train SCAT to predict the future latent frames; finally the predicted latent future frames are reconstructed by trained OAAE; The lower right box shows how we train a object mask predictor based on trained OAAE's latent space; after mask predictor is trained, it is then used solely for evaluating EMD.
  • Figure 2: Comparison of different model variants on the Kubric-Occlusion (Left) and KITTI (Right) dataset.
  • Figure 3: Qualitative results on Autoencoder's reconstruction on KITTI dataset
  • Figure 4: Qualitative results on Autoencoder's reconstruction Kubric-Occlusion dataset
  • Figure 5: Video prediction example on Kubric-Occlusion (1)
  • ...and 3 more figures