Forecasting Future Videos from Novel Views via Disentangled 3D Scene Representation
Sudhir Yarram, Junsong Yuan
TL;DR
This work tackles forecasting future video frames from novel viewpoints (VEST) by replacing entangled, layer-based representations with a continuous 3D scene model built from depth-estimated point clouds. It disentangles geometry from motion and further splits motion forecasting into a two-stage process: first predicting ego-motion and then the residual object motion, enabling more accurate 3D motion flows $\mathbf{u}$ and photorealistic synthesis via differentiable 3D-to-2D splatting. The approach employs semantic segmentation and inpainting to manage disocclusions, and uses multi-scale motion flow blocks (MMFB) to capture dynamics across scales. Experimental results on KITTI and Cityscapes show clear improvements over baselines in VEST, video prediction, and novel-view synthesis, though depth inaccuracies in thin structures remain a limitation. Overall, the method provides a practical, end-to-end framework for high-fidelity future-video forecasting from new viewpoints with tangible advances in 3D geometry-aware rendering.
Abstract
Video extrapolation in space and time (VEST) enables viewers to forecast a 3D scene into the future and view it from novel viewpoints. Recent methods propose to learn an entangled representation, aiming to model layered scene geometry, motion forecasting and novel view synthesis together, while assuming simplified affine motion and homography-based warping at each scene layer, leading to inaccurate video extrapolation. Instead of entangled scene representation and rendering, our approach chooses to disentangle scene geometry from scene motion, via lifting the 2D scene to 3D point clouds, which enables high quality rendering of future videos from novel views. To model future 3D scene motion, we propose a disentangled two-stage approach that initially forecasts ego-motion and subsequently the residual motion of dynamic objects (e.g., cars, people). This approach ensures more precise motion predictions by reducing inaccuracies from entanglement of ego-motion with dynamic object motion, where better ego-motion forecasting could significantly enhance the visual outcomes. Extensive experimental analysis on two urban scene datasets demonstrate superior performance of our proposed method in comparison to strong baselines.
