Table of Contents
Fetching ...

SceneSense: Diffusion Models for 3D Occupancy Synthesis from Partial Observation

Alec Reed, Brendan Crowe, Doncey Albin, Lorin Achey, Bradley Hayes, Christoffer Heckman

TL;DR

SceneSense tackles the problem of planning in partially observed environments by generating plausible local 3D occupancy around a robotic platform in real time. It uses a diffusion-based framework conditioned on RGB-D cues and an occupancy map, paired with an occupancy inpainting mechanism to ensure observed space is never overwritten. The method demonstrates quantitative gains over a running Octomap baseline on HM3D home environments, supported by FID/KID metrics and qualitative visuals, and includes extensive ablations on diffusion steps, guidance scale, and conditioning. The work enables robust, offline-free estimation of unobserved local geometry with potential to accelerate exploration and planning in unknown indoor environments.

Abstract

When exploring new areas, robotic systems generally exclusively plan and execute controls over geometry that has been directly measured. When entering space that was previously obstructed from view such as turning corners in hallways or entering new rooms, robots often pause to plan over the newly observed space. To address this we present SceneScene, a real-time 3D diffusion model for synthesizing 3D occupancy information from partial observations that effectively predicts these occluded or out of view geometries for use in future planning and control frameworks. SceneSense uses a running occupancy map and a single RGB-D camera to generate predicted geometry around the platform at runtime, even when the geometry is occluded or out of view. Our architecture ensures that SceneSense never overwrites observed free or occupied space. By preserving the integrity of the observed map, SceneSense mitigates the risk of corrupting the observed space with generative predictions. While SceneSense is shown to operate well using a single RGB-D camera, the framework is flexible enough to extend to additional modalities. SceneSense operates as part of any system that generates a running occupancy map `out of the box', removing conditioning from the framework. Alternatively, for maximum performance in new modalities, the perception backbone can be replaced and the model retrained for inference in new applications. Unlike existing models that necessitate multiple views and offline scene synthesis, or are focused on filling gaps in observed data, our findings demonstrate that SceneSense is an effective approach to estimating unobserved local occupancy information at runtime. Local occupancy predictions from SceneSense are shown to better represent the ground truth occupancy distribution during the test exploration trajectories than the running occupancy map.

SceneSense: Diffusion Models for 3D Occupancy Synthesis from Partial Observation

TL;DR

SceneSense tackles the problem of planning in partially observed environments by generating plausible local 3D occupancy around a robotic platform in real time. It uses a diffusion-based framework conditioned on RGB-D cues and an occupancy map, paired with an occupancy inpainting mechanism to ensure observed space is never overwritten. The method demonstrates quantitative gains over a running Octomap baseline on HM3D home environments, supported by FID/KID metrics and qualitative visuals, and includes extensive ablations on diffusion steps, guidance scale, and conditioning. The work enables robust, offline-free estimation of unobserved local geometry with potential to accelerate exploration and planning in unknown indoor environments.

Abstract

When exploring new areas, robotic systems generally exclusively plan and execute controls over geometry that has been directly measured. When entering space that was previously obstructed from view such as turning corners in hallways or entering new rooms, robots often pause to plan over the newly observed space. To address this we present SceneScene, a real-time 3D diffusion model for synthesizing 3D occupancy information from partial observations that effectively predicts these occluded or out of view geometries for use in future planning and control frameworks. SceneSense uses a running occupancy map and a single RGB-D camera to generate predicted geometry around the platform at runtime, even when the geometry is occluded or out of view. Our architecture ensures that SceneSense never overwrites observed free or occupied space. By preserving the integrity of the observed map, SceneSense mitigates the risk of corrupting the observed space with generative predictions. While SceneSense is shown to operate well using a single RGB-D camera, the framework is flexible enough to extend to additional modalities. SceneSense operates as part of any system that generates a running occupancy map `out of the box', removing conditioning from the framework. Alternatively, for maximum performance in new modalities, the perception backbone can be replaced and the model retrained for inference in new applications. Unlike existing models that necessitate multiple views and offline scene synthesis, or are focused on filling gaps in observed data, our findings demonstrate that SceneSense is an effective approach to estimating unobserved local occupancy information at runtime. Local occupancy predictions from SceneSense are shown to better represent the ground truth occupancy distribution during the test exploration trajectories than the running occupancy map.
Paper Structure (18 sections, 6 equations, 6 figures, 1 table)

This paper contains 18 sections, 6 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Test house 2 where the robot exploration trajectory is shown via the black points, and the starting point is shown as green. Two SceneSense generations are shown. From left to right (1) Inputs are on the left where green voxels are the local occupancy information as well as the current camera view from the robot. (2) SceneSense occupancy prediction is shown where occupancy information is shown in green and new predicted occupancy is red. (3) The running occupancy information is again shown in green and the ground truth full local occupancy data is shown in yellow.
  • Figure 2: Reverse Diffusion Process: The reverse diffusion process takes the local occupancy information, the current sensor measurements (RGB-D image in this case) and the Gaussian noise of the area to be diffused over. Noise commensurate with the current diffusion step is added to the local occupancy information, which includes occupied (green) and observed unoccupied (red) data. The result is inpainted into the noisy local occupancy prediction as discussed in \ref{['sec:method']}. The inpainted noise data and the feature vectors generated by the perception backbone are provided to the denoising network which generates a new noisy geometry prediction at $t-1$. This processes is repeated as the starting noise $x_T$ is iteratively denoised to $x_0$ which is the final geometry prediction from the framework.
  • Figure 3: Various SceneSense predictions from equivalent input data where green is the running occupancy map and red is the SceneSense predicted occupancy. Given the limited input information the diffusion framework can generate multiple reasonable predictions from the same input conditioning.
  • Figure 4: Test house 1 where the robot explore trajectory is shown via the black points, and the starting point is shown as green. Two SceneSense generations are shown. From left to right (1) Inputs are on the left where green voxels are the local occupancy information as well as the current camera view from the bot. (2) SceneSense occupancy prediction is shown where occupancy information is shown in green and new predicted occupancy is red. (3) The running occupancy information is again shown in green and the ground truth full local occupancy data is shown in yellow.
  • Figure 5: Calculated predicted voxels $v_p$ over occupied voxels $v_o$ ($\frac{v_p}{v_o}$) over the house 2 exploration. (a) Superimposes the running occupancy map over the house mesh where the colors of the occupancy map show how many steps have ran to that point. Green voxels are the running occupancy map from step 0 to step 20, blue are step 0 to step 95, and red are step 0 to 150. These colors correspond with the plot line colors in (b). (b) Shows the $\frac{v_p}{v_o}$ as the robot explores the space. $\frac{v_p}{v_o}$ starts high at time step 0, when the occupancy map is sparse, and quickly drops over the green exploration where more of the local scene is observed. $\frac{v_p}{v_o}$ stays relatively low as the vehicle completes the exploration of the green room, navigates back to the start point and traverses the hallway. $\frac{v_p}{v_o}$ increases slightly as the robot traverses previously unobserved space (red), which requests more predicted voxels as less of the scene has been observed.
  • ...and 1 more figures