Table of Contents
Fetching ...

Voxel-Aggregated Feature Synthesis: Efficient Dense Mapping for Simulated 3D Reasoning

Owen Burns, Rizwan Qureshi

TL;DR

VAFS addresses the heavy computation of open-set dense 3D mapping by leveraging simulator-provided, segmented point clouds and synthesizing per-object views, followed by voxel aggregation to maintain uniform density. The method reduces embeddings from per-frame counts to per-object counts and demonstrates faster runtime with improved semantic IoU on RoCoBench queries, outperforming ConceptFusion and LeRF. This approach makes dense 3D mapping practical for simulation-based embodied research and real-time updates. The key contributions are synthetic view generation per region, voxel pooling for density control, and a ground-truth semantic mapping pipeline.

Abstract

We address the issue of the exploding computational requirements of recent State-of-the-art (SOTA) open set multimodel 3D mapping (dense 3D mapping) algorithms and present Voxel-Aggregated Feature Synthesis (VAFS), a novel approach to dense 3D mapping in simulation. Dense 3D mapping involves segmenting and embedding sequential RGBD frames which are then fused into 3D. This leads to redundant computation as the differences between frames are small but all are individually segmented and embedded. This makes dense 3D mapping impractical for research involving embodied agents in which the environment, and thus the mapping, must be modified with regularity. VAFS drastically reduces this computation by using the segmented point cloud computed by a simulator's physics engine and synthesizing views of each region. This reduces the number of features to embed from the number of captured RGBD frames to the number of objects in the scene, effectively allowing a "ground truth" semantic map to be computed an order of magnitude faster than traditional methods. We test the resulting representation by assessing the IoU scores of semantic queries for different objects in the simulated scene, and find that VAFS exceeds the accuracy and speed of prior dense 3D mapping techniques.

Voxel-Aggregated Feature Synthesis: Efficient Dense Mapping for Simulated 3D Reasoning

TL;DR

VAFS addresses the heavy computation of open-set dense 3D mapping by leveraging simulator-provided, segmented point clouds and synthesizing per-object views, followed by voxel aggregation to maintain uniform density. The method reduces embeddings from per-frame counts to per-object counts and demonstrates faster runtime with improved semantic IoU on RoCoBench queries, outperforming ConceptFusion and LeRF. This approach makes dense 3D mapping practical for simulation-based embodied research and real-time updates. The key contributions are synthetic view generation per region, voxel pooling for density control, and a ground-truth semantic mapping pipeline.

Abstract

We address the issue of the exploding computational requirements of recent State-of-the-art (SOTA) open set multimodel 3D mapping (dense 3D mapping) algorithms and present Voxel-Aggregated Feature Synthesis (VAFS), a novel approach to dense 3D mapping in simulation. Dense 3D mapping involves segmenting and embedding sequential RGBD frames which are then fused into 3D. This leads to redundant computation as the differences between frames are small but all are individually segmented and embedded. This makes dense 3D mapping impractical for research involving embodied agents in which the environment, and thus the mapping, must be modified with regularity. VAFS drastically reduces this computation by using the segmented point cloud computed by a simulator's physics engine and synthesizing views of each region. This reduces the number of features to embed from the number of captured RGBD frames to the number of objects in the scene, effectively allowing a "ground truth" semantic map to be computed an order of magnitude faster than traditional methods. We test the resulting representation by assessing the IoU scores of semantic queries for different objects in the simulated scene, and find that VAFS exceeds the accuracy and speed of prior dense 3D mapping techniques.

Paper Structure

This paper contains 10 sections, 6 equations, 2 figures, 2 tables, 1 algorithm.

Figures (2)

  • Figure 1: The high-level workflow of VAFS. At each time step, we associate points $P$ with segments $C$ and render views of the regions of interest. We then align embeddings of those views with the point cloud and run voxel aggregation to ensure the distribution of points remains uniform. Subsequent time steps represent updates to the point cloud, and the process runs again with new views generated for segments of the point cloud that have changed.
  • Figure 2: Relevancy maps for semantic queries.