MAP-ADAPT: Real-Time Quality-Adaptive Semantic 3D Maps
Jianhao Zheng, Daniel Barath, Marc Pollefeys, Iro Armeni
TL;DR
MAP-ADAPT tackles the inefficiency of uniform-detail 3D semantic maps by introducing a real-time, single-map framework that assigns regional quality levels based on both semantic categories and geometric complexity. It extends voxel-hashing TSDF maps with per-voxel semantic state, an adaptive parent-child hierarchy, and geometry-aware refinement, enabling fine detail where needed while conserving compute and storage. Using a semantic SLAM backbone, Bayesian fusion for voxel semantics, adaptive raycasting, and a multi-resolution mesh extraction, the approach delivers competitive geometric and semantic accuracy with substantial memory savings compared to fixed-resolution maps and avoids issues seen in multi-map baselines. The results on synthetic and real datasets demonstrate practical viability for autonomous agents operating under tight computational budgets, with MAP-ADAPT-SG particularly excelling in geometry-rich regions.
Abstract
Creating 3D semantic reconstructions of environments is fundamental to many applications, especially when related to autonomous agent operation (e.g., goal-oriented navigation or object interaction and manipulation). Commonly, 3D semantic reconstruction systems capture the entire scene in the same level of detail. However, certain tasks (e.g., object interaction) require a fine-grained and high-resolution map, particularly if the objects to interact are of small size or intricate geometry. In recent practice, this leads to the entire map being in the same high-quality resolution, which results in increased computational and storage costs. To address this challenge, we propose MAP-ADAPT, a real-time method for quality-adaptive semantic 3D reconstruction using RGBD frames. MAP-ADAPT is the first adaptive semantic 3D mapping algorithm that, unlike prior work, generates directly a single map with regions of different quality based on both the semantic information and the geometric complexity of the scene. Leveraging a semantic SLAM pipeline for pose and semantic estimation, we achieve comparable or superior results to state-of-the-art methods on synthetic and real-world data, while significantly reducing storage and computation requirements.
