Table of Contents
Fetching ...

Volumetric Semantically Consistent 3D Panoptic Mapping

Yang Miao, Iro Armeni, Marc Pollefeys, Daniel Barath

TL;DR

The paper tackles online 3D semantic-instance mapping for unstructured environments by extending a voxel-TSDF framework with (i) semantic prediction confidence integration, (ii) semantically consistent super-point construction, and (iii) graph-based semantic labeling plus instance refinement. The method fuses per-frame 2D panoptic-geometric predictions into a global map, optimizes semantic labels over a super-point graph, and refines instance assignments to reduce under- and over-segmentation, achieving state-of-the-art results on large public datasets. Notably, it demonstrates robustness under SLAM trajectories and reveals that GT pose-based evaluations significantly overstate real-world performance, underscoring the need for SLAM-based evaluation in future work. The practical impact lies in producing accurate, real-time, scalable 3D semantic maps for autonomous agents in complex scenes, with potential improvements from better 2D panoptic inputs and pose estimation pipelines.

Abstract

We introduce an online 2D-to-3D semantic instance mapping algorithm aimed at generating comprehensive, accurate, and efficient semantic 3D maps suitable for autonomous agents in unstructured environments. The proposed approach is based on a Voxel-TSDF representation used in recent algorithms. It introduces novel ways of integrating semantic prediction confidence during mapping, producing semantic and instance-consistent 3D regions. Further improvements are achieved by graph optimization-based semantic labeling and instance refinement. The proposed method achieves accuracy superior to the state of the art on public large-scale datasets, improving on a number of widely used metrics. We also highlight a downfall in the evaluation of recent studies: using the ground truth trajectory as input instead of a SLAM-estimated one substantially affects the accuracy, creating a large gap between the reported results and the actual performance on real-world data.

Volumetric Semantically Consistent 3D Panoptic Mapping

TL;DR

The paper tackles online 3D semantic-instance mapping for unstructured environments by extending a voxel-TSDF framework with (i) semantic prediction confidence integration, (ii) semantically consistent super-point construction, and (iii) graph-based semantic labeling plus instance refinement. The method fuses per-frame 2D panoptic-geometric predictions into a global map, optimizes semantic labels over a super-point graph, and refines instance assignments to reduce under- and over-segmentation, achieving state-of-the-art results on large public datasets. Notably, it demonstrates robustness under SLAM trajectories and reveals that GT pose-based evaluations significantly overstate real-world performance, underscoring the need for SLAM-based evaluation in future work. The practical impact lies in producing accurate, real-time, scalable 3D semantic maps for autonomous agents in complex scenes, with potential improvements from better 2D panoptic inputs and pose estimation pipelines.

Abstract

We introduce an online 2D-to-3D semantic instance mapping algorithm aimed at generating comprehensive, accurate, and efficient semantic 3D maps suitable for autonomous agents in unstructured environments. The proposed approach is based on a Voxel-TSDF representation used in recent algorithms. It introduces novel ways of integrating semantic prediction confidence during mapping, producing semantic and instance-consistent 3D regions. Further improvements are achieved by graph optimization-based semantic labeling and instance refinement. The proposed method achieves accuracy superior to the state of the art on public large-scale datasets, improving on a number of widely used metrics. We also highlight a downfall in the evaluation of recent studies: using the ground truth trajectory as input instead of a SLAM-estimated one substantially affects the accuracy, creating a large gap between the reported results and the actual performance on real-world data.
Paper Structure (13 sections, 9 equations, 6 figures, 4 tables)

This paper contains 13 sections, 9 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: 3D panoptic segmentation comparison: the proposed method versus Voxblox++ Semantic:voxbloxplusplus, Han et al. Han20CVPR, and INS-CONV INS_CONV. In this example, all competitors under-segment the bookshelf (top rectangle) and over-segment the table (bottom). The proposed technique sets a new benchmark in 2D-to-3D semantic-instance segmentation by ensuring semantic consistency across super-points (a set of adjacent 3D voxels) through graph-based semantic optimization and instance refinement.
  • Figure 2: In \ref{['section:2d_segmentation']}, the proposed method gets an RGB-D sequence and runs panoptic and geometric segmentation that are then fused to extract 3D surfaces with panoptic labels. In \ref{['section:super_point']}, it incrementally lifts frame-wise surfaces into a global coordinate system to create semantically consistent superpoints. In \ref{['section:segment_graph']}, superpoint graph is constructed and updated. Semantic segmentation is performed using the graph. In \ref{['section:instance_refinement']}, the initial instance segmentation is refined by graph optimization.
  • Figure 3: 3D segments colored by super-point labels, depicting objects (e.g., chair, wall) composed of single or multiple super-points. Each super-point is unique to an object.
  • Figure 4: Super-point segmentation without (left; as in Semantic:voxbloxplusplus) and with (right) semantic consistency checks as proposed in Sec. \ref{['section:super_point']}. Without semantic consistency, noisy geometric segmentations GeometricSeg result in merging the super-points of "wall" and "bed". By distinguishing semantic labels as proposed, such incorrect mergers happen less often.
  • Figure 5: Under-segmentation of the chair and the printer in the corner using Voxblox++ Semantic:voxbloxplusplus, due to inaccuracies in panoptic segmentation. Instances are shown by color.
  • ...and 1 more figures