Table of Contents
Fetching ...

MAP-ADAPT: Real-Time Quality-Adaptive Semantic 3D Maps

Jianhao Zheng, Daniel Barath, Marc Pollefeys, Iro Armeni

TL;DR

MAP-ADAPT tackles the inefficiency of uniform-detail 3D semantic maps by introducing a real-time, single-map framework that assigns regional quality levels based on both semantic categories and geometric complexity. It extends voxel-hashing TSDF maps with per-voxel semantic state, an adaptive parent-child hierarchy, and geometry-aware refinement, enabling fine detail where needed while conserving compute and storage. Using a semantic SLAM backbone, Bayesian fusion for voxel semantics, adaptive raycasting, and a multi-resolution mesh extraction, the approach delivers competitive geometric and semantic accuracy with substantial memory savings compared to fixed-resolution maps and avoids issues seen in multi-map baselines. The results on synthetic and real datasets demonstrate practical viability for autonomous agents operating under tight computational budgets, with MAP-ADAPT-SG particularly excelling in geometry-rich regions.

Abstract

Creating 3D semantic reconstructions of environments is fundamental to many applications, especially when related to autonomous agent operation (e.g., goal-oriented navigation or object interaction and manipulation). Commonly, 3D semantic reconstruction systems capture the entire scene in the same level of detail. However, certain tasks (e.g., object interaction) require a fine-grained and high-resolution map, particularly if the objects to interact are of small size or intricate geometry. In recent practice, this leads to the entire map being in the same high-quality resolution, which results in increased computational and storage costs. To address this challenge, we propose MAP-ADAPT, a real-time method for quality-adaptive semantic 3D reconstruction using RGBD frames. MAP-ADAPT is the first adaptive semantic 3D mapping algorithm that, unlike prior work, generates directly a single map with regions of different quality based on both the semantic information and the geometric complexity of the scene. Leveraging a semantic SLAM pipeline for pose and semantic estimation, we achieve comparable or superior results to state-of-the-art methods on synthetic and real-world data, while significantly reducing storage and computation requirements.

MAP-ADAPT: Real-Time Quality-Adaptive Semantic 3D Maps

TL;DR

MAP-ADAPT tackles the inefficiency of uniform-detail 3D semantic maps by introducing a real-time, single-map framework that assigns regional quality levels based on both semantic categories and geometric complexity. It extends voxel-hashing TSDF maps with per-voxel semantic state, an adaptive parent-child hierarchy, and geometry-aware refinement, enabling fine detail where needed while conserving compute and storage. Using a semantic SLAM backbone, Bayesian fusion for voxel semantics, adaptive raycasting, and a multi-resolution mesh extraction, the approach delivers competitive geometric and semantic accuracy with substantial memory savings compared to fixed-resolution maps and avoids issues seen in multi-map baselines. The results on synthetic and real datasets demonstrate practical viability for autonomous agents operating under tight computational budgets, with MAP-ADAPT-SG particularly excelling in geometry-rich regions.

Abstract

Creating 3D semantic reconstructions of environments is fundamental to many applications, especially when related to autonomous agent operation (e.g., goal-oriented navigation or object interaction and manipulation). Commonly, 3D semantic reconstruction systems capture the entire scene in the same level of detail. However, certain tasks (e.g., object interaction) require a fine-grained and high-resolution map, particularly if the objects to interact are of small size or intricate geometry. In recent practice, this leads to the entire map being in the same high-quality resolution, which results in increased computational and storage costs. To address this challenge, we propose MAP-ADAPT, a real-time method for quality-adaptive semantic 3D reconstruction using RGBD frames. MAP-ADAPT is the first adaptive semantic 3D mapping algorithm that, unlike prior work, generates directly a single map with regions of different quality based on both the semantic information and the geometric complexity of the scene. Leveraging a semantic SLAM pipeline for pose and semantic estimation, we achieve comparable or superior results to state-of-the-art methods on synthetic and real-world data, while significantly reducing storage and computation requirements.
Paper Structure (10 sections, 3 equations, 9 figures, 14 tables, 1 algorithm)

This paper contains 10 sections, 3 equations, 9 figures, 14 tables, 1 algorithm.

Figures (9)

  • Figure 1: MAP-ADAPT. Our method generates quality-adaptive semantic 3D maps of environments, where regions of different semantics and geometric complexity are reconstructed in different quality levels. An example map is shown here: 3D reconstructed mesh (top-left) and the semantic quality mask (bottom-left). Mask colors denote three quality levels, where red is high, green is middle, and blue is coarse. A plant reconstructed in high quality due to its semantic label is highlighted (top-right). Though the coffee machine based on its label should appear coarse, it is still mapped in fine resolution due to high geometric complexity (bottom-right).
  • Figure 2: Overview of MAP-ADAPT. (a) Given RGBD frames, we estimate (b-i) semantic segmentation and (b-iv) camera pose and compute (b-ii) geometric complexity. (c-i) We integrate geometric and semantic information (b-iii) on the TSDF voxel map. The geometric complexity and the semantic label will define the voxel size of that region of the map. (c-ii) shows the multi-resolution mesh output. The adaptive structure we use is shown in (c-iii).
  • Figure 3: Illustration of forming a cube to generate a mesh from our multi-resolution map. (a) When a neighboring voxel of the queried resolution (orange node) does not exist, the corresponding coarser-resolution one (green node) will be used instead. (b) A ghost mesh is generated at the boundary of resolutions.
  • Figure 4: Reconstruction results per method. Top example is on HSSD and bottom one on ScanNet datasets. Geometric and completion errors are shown as heatmaps; the darker the color, the closer to the GT geometry. For semantic map, results are colorized per quality level; different semantics in the same quality level range from brighter to darker. Another heatmap is used to show the estimated geometric complexity. We highlight regions that are classified into high-quality semantics (red block) or have large geometric variance (orange block). Best viewed on screen.
  • Figure A: Boundaries. There are three distinct types of boundaries among coarsest voxels. (i) To create meshes from subvoxels on faces, we need adjacent subvoxels in 1 neighboring coarse voxel. (ii) For subvoxels positioned along the edges, subvoxels from 3 neighbors are required. (iii) Subvoxels from all 7 neighbors are queried to form mesh for the subvoxel located at the corner.
  • ...and 4 more figures