Forecasting Future Videos from Novel Views via Disentangled 3D Scene Representation

Sudhir Yarram; Junsong Yuan

Forecasting Future Videos from Novel Views via Disentangled 3D Scene Representation

Sudhir Yarram, Junsong Yuan

TL;DR

This work tackles forecasting future video frames from novel viewpoints (VEST) by replacing entangled, layer-based representations with a continuous 3D scene model built from depth-estimated point clouds. It disentangles geometry from motion and further splits motion forecasting into a two-stage process: first predicting ego-motion and then the residual object motion, enabling more accurate 3D motion flows $\mathbf{u}$ and photorealistic synthesis via differentiable 3D-to-2D splatting. The approach employs semantic segmentation and inpainting to manage disocclusions, and uses multi-scale motion flow blocks (MMFB) to capture dynamics across scales. Experimental results on KITTI and Cityscapes show clear improvements over baselines in VEST, video prediction, and novel-view synthesis, though depth inaccuracies in thin structures remain a limitation. Overall, the method provides a practical, end-to-end framework for high-fidelity future-video forecasting from new viewpoints with tangible advances in 3D geometry-aware rendering.

Abstract

Video extrapolation in space and time (VEST) enables viewers to forecast a 3D scene into the future and view it from novel viewpoints. Recent methods propose to learn an entangled representation, aiming to model layered scene geometry, motion forecasting and novel view synthesis together, while assuming simplified affine motion and homography-based warping at each scene layer, leading to inaccurate video extrapolation. Instead of entangled scene representation and rendering, our approach chooses to disentangle scene geometry from scene motion, via lifting the 2D scene to 3D point clouds, which enables high quality rendering of future videos from novel views. To model future 3D scene motion, we propose a disentangled two-stage approach that initially forecasts ego-motion and subsequently the residual motion of dynamic objects (e.g., cars, people). This approach ensures more precise motion predictions by reducing inaccuracies from entanglement of ego-motion with dynamic object motion, where better ego-motion forecasting could significantly enhance the visual outcomes. Extensive experimental analysis on two urban scene datasets demonstrate superior performance of our proposed method in comparison to strong baselines.

Forecasting Future Videos from Novel Views via Disentangled 3D Scene Representation

TL;DR

and photorealistic synthesis via differentiable 3D-to-2D splatting. The approach employs semantic segmentation and inpainting to manage disocclusions, and uses multi-scale motion flow blocks (MMFB) to capture dynamics across scales. Experimental results on KITTI and Cityscapes show clear improvements over baselines in VEST, video prediction, and novel-view synthesis, though depth inaccuracies in thin structures remain a limitation. Overall, the method provides a practical, end-to-end framework for high-fidelity future-video forecasting from new viewpoints with tangible advances in 3D geometry-aware rendering.

Abstract

Paper Structure (28 sections, 6 equations, 5 figures, 4 tables)

This paper contains 28 sections, 6 equations, 5 figures, 4 tables.

Introduction
Related Work
Video prediction.
Novel view synthesis.
Depth-based 3D scene representation.
Our Method
Disentangled 3D Scene Representation
Constructing 3D Point Cloud
Semantic segmentation.
Image and Depth Inpainting.
Feature Encoding.
Splatting and Rendering
Forecasting Future 3D Motion
Ego-motion forecasting (EMF).
Computing 3D motion flow.
...and 13 more sections

Figures (5)

Figure 1: Comparisons of VEST approaches. Compared to VEST-MPI zhang2022video, our method features: (1) Disentangled 3D geometry and motion representation: While VEST-MPI zhang2022video relies on an entangled layered MPI representation with simplified affine motion and homography-based warping, we employ depth maps to transform 2D images into 3D point clouds, disentangling scene geometry from motion for high-quality rendering from novel viewpoints; (2) Disentangled ego-motion and object motion forecast: Departing from VEST-MPI's simultaneous modeling of ego-motion and object motion, we adopt a disentangled two-stage forecasting approach. Our approach first predicts ego-motion, then addresses residual object motion. This separation allows our model to predict 3D motion more accurately, improving the accuracy of 3D motion forecasts.
Figure 2: Method overview. Our framework aims to forecast a 3D scene into the future and view it from novel viewpoints. It comprises three primary steps (1) Constructing 3D point clouds: Starting with two past frames as the input, we construct per-frame 3D point clouds. (i) The process for each frame involves depth estimation, dis-occlusion handling via inpainting, and feature extraction to finally generate what we refer to as feature layer. (ii) The point-wise features in this feature layer are then lifted into 3D space using corresponding depth values, generating 3D point clouds. This process is performed on both $\mathbf{I}_{(t-1)}$ and $\mathbf{I}_{(t)}$ to obtain feature layers $\mathcal{F}_{(t-1)}$ and $\mathcal{F}_{(t)}$ and point clouds $\mathcal{P}_{(t-1)}$ and $\mathcal{P}_{(t)}$. (2) Forecasting future 3D motion: We leverage the feature layers $\mathcal{F}_{(t-1)}$ and $\mathcal{F}_{(t)}$ to forecast future 3D motion for each of the point clouds. This forecasted 3D motion allows us to update the positions of point clouds $\mathcal{P}_{(t-1)}$ and $\mathcal{P}_{(t)}$ to their new, forecasted locations. (3) Splatting and Rendering: A point-based renderer processes these motion-adjusted point clouds through 3D-to-2D splatting to generate feature maps. Finally, refinement network takes these rendered feature maps and decodes them to synthesize a novel view $\hat{I}'_{(t+1)}$ based on the target viewpoint.
Figure 3: (a) Constructing 3D point cloud. (1) Estimate the depth map $\mathbf{D}$ from the input image $\mathbf{I}$. (2) Address "holes’’ in future frames caused by dis-occlusions from dynamic object motion: (i) segment dynamic category (foreground) objects to produce a binary mask $\mathbf{M}$, identifying potential regions for "holes". (ii) mask these foreground regions in both input image and depth map, then inpaint them using the background context. (3) Extract features from both original and inpainted frames to produce $\mathbf{F}$ and $\mathbf{F}^{\overline{\text{BG}}}$. (4) Create 3D point cloud $\mathcal{P}$ by unprojecting the 2D features $\mathbf{F}$, $\mathbf{F}^{\overline{\text{BG}}}$ into 3D, using depth maps $\mathbf{D}$, $\mathbf{D}^{\overline{\text{BG}}}$, respectively. For simplicity, we refer to the set $\{\mathbf{F}, \mathbf{D}, \mathbf{M}\}$ as original feature layer, denoted by $\mathcal{F}$ and the set of $\{\mathbf{F}^{\overline{\text{BG}}}, \mathbf{D}^{\overline{\text{BG}}}, \mathbf{M}\}$ as inpainted feature layer$\mathcal{F}^{\overline{\text{BG}}}$. (b) Forecasting future 3D motion. Given feature layers from past frames, our method forecasts future 3D motion flow in two stages: (1) ego-motion forecasting using the EMF module, which processes the background (static category) across frames using inpainted feature layers $\mathcal{F}_{(t-1)}^{\overline{\text{BG}}}$ and $\mathcal{F}_{(t)}^{\overline{\text{BG}}}$, yielding two relative ego-pose transformations, $\mathcal{T}_{(t-1) \rightarrow (t+1)}$ and $\mathcal{T}_{(t) \rightarrow (t+1)}$. These transformations lead to initial 3D motion flows $\mathbf{u}^{0}_{(t-1) \rightarrow (t+1)}$ and $\mathbf{u}^{0}_{(t) \rightarrow (t+1)}$, referred as $\mathbf{U}^{0}_{(t+1)}$. (2) The OMF module then refines the initial 3D motion flow $\mathbf{U}^{0}_{(t+1)}$ by accounting for foreground object motion, using original and inpainted feature layers to derive the final forecasted 3D motion flow, $\mathbf{U}^{L}_{(t+1)}$, after $L$ MMFB blocks. (c) Multi-scale motion flow block (MMFB). We illustrate the design of a MMFB block here.
Figure 4: Qualitative comparison with VEST-MPI zhang2022video on video prediction task (VEST-[S,T]). The results show that our method produces sharper frames with high-quality motion forecasts, particularly over the long term.
Figure 5: Qualitative Results for concurrent video extrapolation in space and time (VEST-[S+T]). (a) The DMVFN hu2023dynamic$\rightarrow$ 3D Photo shih20203d baseline produces stretching artifacts around the car due to disocclusions caused by the use of 2D flow-based backward warping. (b) In WALDO le2023waldo$\rightarrow$ 3D Photo shih20203d, indicates the inconsistent motion of the road. This occurs due to the layered approach of WALDO le2023waldo, where the road and car are mistakenly assigned to the same layer and have similar motion. (c) Our approach mitigates these issues, achieving high-fidelity motion forecasting (indicated by ).

Forecasting Future Videos from Novel Views via Disentangled 3D Scene Representation

TL;DR

Abstract

Forecasting Future Videos from Novel Views via Disentangled 3D Scene Representation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)