Table of Contents
Fetching ...

Active Perception using Neural Radiance Fields

Siming He, Christopher D. Hsu, Dexter Ong, Yifei Simon Shao, Pratik Chaudhari

TL;DR

The paper formulates active perception as maximizing the predictive information $I(y_{ ext{future}}, y_{ ext{past}})$ to guide sensing and motion in indoor environments. It proposes a semantic NeRF representation that can synthesize multi-modal observations and enables a sampling-based planner to evaluate information gain along dynamically feasible quadrotor trajectories. Through Habitat-based simulations, the approach demonstrates improved object localization and scene reconstruction quality, driven by an integrated loop of perception, generative modeling, and planning. The work highlights the practical potential of information-theoretic objectives combined with rich scene representations for autonomous exploration tasks.

Abstract

We study active perception from first principles to argue that an autonomous agent performing active perception should maximize the mutual information that past observations posses about future ones. Doing so requires (a) a representation of the scene that summarizes past observations and the ability to update this representation to incorporate new observations (state estimation and mapping), (b) the ability to synthesize new observations of the scene (a generative model), and (c) the ability to select control trajectories that maximize predictive information (planning). This motivates a neural radiance field (NeRF)-like representation which captures photometric, geometric and semantic properties of the scene grounded. This representation is well-suited to synthesizing new observations from different viewpoints. And thereby, a sampling-based planner can be used to calculate the predictive information from synthetic observations along dynamically-feasible trajectories. We use active perception for exploring cluttered indoor environments and employ a notion of semantic uncertainty to check for the successful completion of an exploration task. We demonstrate these ideas via simulation in realistic 3D indoor environments.

Active Perception using Neural Radiance Fields

TL;DR

The paper formulates active perception as maximizing the predictive information to guide sensing and motion in indoor environments. It proposes a semantic NeRF representation that can synthesize multi-modal observations and enables a sampling-based planner to evaluate information gain along dynamically feasible quadrotor trajectories. Through Habitat-based simulations, the approach demonstrates improved object localization and scene reconstruction quality, driven by an integrated loop of perception, generative modeling, and planning. The work highlights the practical potential of information-theoretic objectives combined with rich scene representations for autonomous exploration tasks.

Abstract

We study active perception from first principles to argue that an autonomous agent performing active perception should maximize the mutual information that past observations posses about future ones. Doing so requires (a) a representation of the scene that summarizes past observations and the ability to update this representation to incorporate new observations (state estimation and mapping), (b) the ability to synthesize new observations of the scene (a generative model), and (c) the ability to select control trajectories that maximize predictive information (planning). This motivates a neural radiance field (NeRF)-like representation which captures photometric, geometric and semantic properties of the scene grounded. This representation is well-suited to synthesizing new observations from different viewpoints. And thereby, a sampling-based planner can be used to calculate the predictive information from synthetic observations along dynamically-feasible trajectories. We use active perception for exploring cluttered indoor environments and employ a notion of semantic uncertainty to check for the successful completion of an exploration task. We demonstrate these ideas via simulation in realistic 3D indoor environments.
Paper Structure (15 sections, 17 equations, 7 figures)

This paper contains 15 sections, 17 equations, 7 figures.

Figures (7)

  • Figure 1: Top: Trajectories of a quadrotor that actively explores a complex and cluttered indoor environment to localize all the different kinds of objects. Our approach to active perception maximizes the mutual information of the past observations (RGBD images and semantic segmentation masks) with respect to future observations using a generative model to select highly informative trajectories that explore large parts of the scene quickly. Bottom: We build a neural-radiance field (NeRF) representation of the scene to calculate this mutual information. This provides us with an accurate representation of the free space within which we can sample dynamically-feasible trajectories for a differentially-flat model of a quadrotor. This picture shows a mesh constructed from the voxel-grid representation implicit inside the NeRF after active exploration; color denotes objects of different categories predicted by our semantic NeRF.
  • Figure 2: A schematic of the neural architecture used in our approach: in addition to the standard NeRF model that predicts color ($c$) and density ($\sigma$), we also have an output that predicts categories of the object at location ($t$).
  • Figure 3: Evaluation of the uncertainty quantification metric. We train with 39 observations from a Habitat scene (by rotating in place) and test using 18 new viewpoints of the scene. (a) From left to right columns: ground truth observation, NeRF prediction, squared residual, (zero-one loss for categories), and the estimate of uncertainty, for RGB, depth and semantic segmentation (top to bottom row). (b) Coverage of the error bars obtained from uncertainty prediction, i.e., fraction of true RGB values (top left) and depth (top right) that lie within the predicted uncertainty interval, as a function of training steps. Occupancy uncertainty (bottom left) of test observations. Proportion of pixels where the object category was incorrect and uncertain with entropy more than 0.1 (bottom right). Shaded regions denote one standard deviation.
  • Figure 4: Top: Predictive information ($\text{I}_{\text{pred}}$) of the executed trajectories for the three different scenes. Middle: $\text{I}_{\text{pred}}$ integrated along trajectories at the beginning, middle, and end of active perception for scene 3 (top right). Bottom: Correlation among the four types of $\text{I}_{\text{pred}}$ at beginning, middle and end for scene 3 (top right).
  • Figure 5: Exploring doorways while performing active mapping: Cross section of the Instant-NGP voxel grid at 1.5 m height after exploration is shown on the left. The quadrotor starts at the red star. Blue pixels with value -1 are obstacles and the rest are free space. The color of a pixel is proportional to the number of times the field of view of the camera observes this pixel. This observation frequency is used in the frequency baseline. The black boundary in the left image denotes the learned map at initialization (without the quadrotor moving, using a few viewpoints). Right after the first trajectory is executed (seen in the right image), the quadrotor discovers a lot of free space in the scene, this is the region enclosed in the white boundary.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Example 1: Linear system