CARFF: Conditional Auto-encoded Radiance Field for 3D Scene Forecasting

Jiezhi Yang; Khushi Desai; Charles Packer; Harshil Bhatia; Nicholas Rhinehart; Rowan McAllister; Joseph Gonzalez

CARFF: Conditional Auto-encoded Radiance Field for 3D Scene Forecasting

Jiezhi Yang, Khushi Desai, Charles Packer, Harshil Bhatia, Nicholas Rhinehart, Rowan McAllister, Joseph Gonzalez

TL;DR

CARFF addresses probabilistic 3D scene forecasting under partial observability by encoding ego-centric images into a latent 3D belief space $q_ ext{φ}(z|I^t_c)$ and forecasting its evolution with a mixture-density network. The approach introduces a Pose-Conditional VAE (PC-VAE) to obtain pose-invariant latents and a NeRF decoder conditioned on these latents, enabling 3D scene reconstruction from beliefs. An MDN-based forecaster handles multi-modal future states within a POMDP, supporting planning under occlusions. Across Blender and CARLA experiments, CARFF demonstrates accurate 3D novel view synthesis, multi-view reasoning under uncertainty, and planning capabilities in driving-like scenarios. This framework provides a principled method for perceiving, forecasting, and acting under uncertainty in complex 3D environments.

Abstract

We propose CARFF, a method for predicting future 3D scenes given past observations. Our method maps 2D ego-centric images to a distribution over plausible 3D latent scene configurations and predicts the evolution of hypothesized scenes through time. Our latents condition a global Neural Radiance Field (NeRF) to represent a 3D scene model, enabling explainable predictions and straightforward downstream planning. This approach models the world as a POMDP and considers complex scenarios of uncertainty in environmental states and dynamics. Specifically, we employ a two-stage training of Pose-Conditional-VAE and NeRF to learn 3D representations, and auto-regressively predict latent scene representations utilizing a mixture density network. We demonstrate the utility of our method in scenarios using the CARLA driving simulator, where CARFF enables efficient trajectory and contingency planning in complex multi-agent autonomous driving scenarios involving occlusions.

CARFF: Conditional Auto-encoded Radiance Field for 3D Scene Forecasting

TL;DR

CARFF addresses probabilistic 3D scene forecasting under partial observability by encoding ego-centric images into a latent 3D belief space

and forecasting its evolution with a mixture-density network. The approach introduces a Pose-Conditional VAE (PC-VAE) to obtain pose-invariant latents and a NeRF decoder conditioned on these latents, enabling 3D scene reconstruction from beliefs. An MDN-based forecaster handles multi-modal future states within a POMDP, supporting planning under occlusions. Across Blender and CARLA experiments, CARFF demonstrates accurate 3D novel view synthesis, multi-view reasoning under uncertainty, and planning capabilities in driving-like scenarios. This framework provides a principled method for perceiving, forecasting, and acting under uncertainty in complex 3D environments.

Abstract

Paper Structure (54 sections, 6 equations, 13 figures, 8 tables)

This paper contains 54 sections, 6 equations, 13 figures, 8 tables.

Introduction
Related work
NeRF and 3D representations
Neural radiance fields.
Multi-scene NeRF:
Scene Forecasting
Planning in 2D space:
NeRF in robotics:
Method
Pose-Conditional VAE (PC-VAE) and NeRF
Architecture:
Training methodology:
Loss:
Scene Forecasting
Formulation:
...and 39 more sections

Figures (13)

Figure 1: CARFF 3D planning application for driving. An input image containing a partially observable view of an intersection is processed by CARFF's encoder to establish 3D environment state beliefs, i.e. the predicted possible state of the world: whether or not there could be another vehicle approaching the intersection. These beliefs are used to forecast the future in 3D for planning, generating one among two possible actions for the vehicle to merge into the other lane.
Figure 2: Novel view planning application. CARFF allows reasoning behind occluded views from the ego car as simple as moving the camera to see the sampled belief predictions, allowing simple downstream planning using, for example, density probing or 2D segmentation models from arbitrary angles.
Figure 3: Visualizing CARFF's two stage training process. Left: The convolutional VIT-based encoder encodes each image $I$ at timestamps $t, t'$ and camera poses $c, c'$ into Gaussian latent distributions. Assuming two timestamps and an overparameterized latent, one Gaussian distribution will have a smaller $\sigma^2$, and different $\mu$ across timestamps. Upper Right: The pose-conditional decoder stochastically decodes the sampled latent $z$ using the camera pose $c"$ into images $I_{c"}^t$ and $I_{c"}^{t'}$. The decoded reconstruction and ground truth images are used for the loss $\mathcal{L_{\text{MSE, PC-VAE}}}$. Lower Right: A NeRF is trained by conditioning on the latent variables sampled from the optimized Gaussian parameters. These parameters characterize the distinct timestamp distributions derived from the PC-VAE. An MSE loss is calculated for NeRF as $\mathcal{L_{\text{MSE, NeRF}}}$.
Figure 4: Multi-scene CARLA datasets. Varying car configurations and scenes for the Multi-Scene Two Lane Merge dataset (left) and the Multi-Scene Approaching Intersection dataset (right).
Figure 5: Blender dataset. Blender dataset with a blue cube and a potential red cylinder exhibiting probabilistic temporal movement. The possible occlusions from different camera angles demonstrate how movement needs to be modeled probabilistically.
...and 8 more figures

CARFF: Conditional Auto-encoded Radiance Field for 3D Scene Forecasting

TL;DR

Abstract

CARFF: Conditional Auto-encoded Radiance Field for 3D Scene Forecasting

Authors

TL;DR

Abstract

Table of Contents

Figures (13)