CARFF: Conditional Auto-encoded Radiance Field for 3D Scene Forecasting
Jiezhi Yang, Khushi Desai, Charles Packer, Harshil Bhatia, Nicholas Rhinehart, Rowan McAllister, Joseph Gonzalez
TL;DR
CARFF addresses probabilistic 3D scene forecasting under partial observability by encoding ego-centric images into a latent 3D belief space $q_ ext{φ}(z|I^t_c)$ and forecasting its evolution with a mixture-density network. The approach introduces a Pose-Conditional VAE (PC-VAE) to obtain pose-invariant latents and a NeRF decoder conditioned on these latents, enabling 3D scene reconstruction from beliefs. An MDN-based forecaster handles multi-modal future states within a POMDP, supporting planning under occlusions. Across Blender and CARLA experiments, CARFF demonstrates accurate 3D novel view synthesis, multi-view reasoning under uncertainty, and planning capabilities in driving-like scenarios. This framework provides a principled method for perceiving, forecasting, and acting under uncertainty in complex 3D environments.
Abstract
We propose CARFF, a method for predicting future 3D scenes given past observations. Our method maps 2D ego-centric images to a distribution over plausible 3D latent scene configurations and predicts the evolution of hypothesized scenes through time. Our latents condition a global Neural Radiance Field (NeRF) to represent a 3D scene model, enabling explainable predictions and straightforward downstream planning. This approach models the world as a POMDP and considers complex scenarios of uncertainty in environmental states and dynamics. Specifically, we employ a two-stage training of Pose-Conditional-VAE and NeRF to learn 3D representations, and auto-regressively predict latent scene representations utilizing a mixture density network. We demonstrate the utility of our method in scenarios using the CARLA driving simulator, where CARFF enables efficient trajectory and contingency planning in complex multi-agent autonomous driving scenarios involving occlusions.
