Table of Contents
Fetching ...

Lift, Splat, Map: Lifting Foundation Masks for Label-Free Semantic Scene Completion

Arthur Zhang, Rainier Heijne, Joydeep Biswas

TL;DR

This work proposes LSMap, a method that lifts masks from visual foundation models to predict a continuous, open-set semantic and elevation-aware representation in bird's eye view (BEV) for the entire scene, including regions underneath dynamic entities and in occluded areas.

Abstract

Autonomous mobile robots deployed in urban environments must be context-aware, i.e., able to distinguish between different semantic entities, and robust to occlusions. Current approaches like semantic scene completion (SSC) require pre-enumerating the set of classes and costly human annotations, while representation learning methods relax these assumptions but are not robust to occlusions and learn representations tailored towards auxiliary tasks. To address these limitations, we propose LSMap, a method that lifts masks from visual foundation models to predict a continuous, open-set semantic and elevation-aware representation in bird's eye view (BEV) for the entire scene, including regions underneath dynamic entities and in occluded areas. Our model only requires a single RGBD image, does not require human labels, and operates in real time. We quantitatively demonstrate our approach outperforms existing models trained from scratch on semantic and elevation scene completion tasks with finetuning. Furthermore, we show that our pre-trained representation outperforms existing visual foundation models at unsupervised semantic scene completion. We evaluate our approach using CODa, a large-scale, real-world urban robot dataset. Supplementary visualizations, code, data, and pre-trained models, will be publicly available soon.

Lift, Splat, Map: Lifting Foundation Masks for Label-Free Semantic Scene Completion

TL;DR

This work proposes LSMap, a method that lifts masks from visual foundation models to predict a continuous, open-set semantic and elevation-aware representation in bird's eye view (BEV) for the entire scene, including regions underneath dynamic entities and in occluded areas.

Abstract

Autonomous mobile robots deployed in urban environments must be context-aware, i.e., able to distinguish between different semantic entities, and robust to occlusions. Current approaches like semantic scene completion (SSC) require pre-enumerating the set of classes and costly human annotations, while representation learning methods relax these assumptions but are not robust to occlusions and learn representations tailored towards auxiliary tasks. To address these limitations, we propose LSMap, a method that lifts masks from visual foundation models to predict a continuous, open-set semantic and elevation-aware representation in bird's eye view (BEV) for the entire scene, including regions underneath dynamic entities and in occluded areas. Our model only requires a single RGBD image, does not require human labels, and operates in real time. We quantitatively demonstrate our approach outperforms existing models trained from scratch on semantic and elevation scene completion tasks with finetuning. Furthermore, we show that our pre-trained representation outperforms existing visual foundation models at unsupervised semantic scene completion. We evaluate our approach using CODa, a large-scale, real-world urban robot dataset. Supplementary visualizations, code, data, and pre-trained models, will be publicly available soon.
Paper Structure (24 sections, 9 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 24 sections, 9 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: LSMap Architecture and Training Pipeline. a) We lift instance masks from SegmentAnything kirillov2023segment to bird's eye view (BEV) space. We greedily merge sequential masks based on mask intersection area and use contrastive loss to learn continuous semantic representations using these BEV masks. b) We pretrain a feature and depth completion backbone using Dino oquab2023dinov2 features and ground truth depth labels. c) LSMap predicts and splats semantic features to a BEV feature map and uses a multi-head inpainting network to predict a continuous representation composed of semantic and elevation features. For more details, please see Sec. \ref{['ssec:technicalformulation']}.
  • Figure 2: LSMap Predictions. Our model takes RGB and LiDAR depth measurements in the form of RGB-D images (first row), and predicts continuous semantic features and elevation for the entire field of view. We perform PCA dimensionality reduction on the continuous semantic features for visualization. Patches shaded in white correspond to unoccluded regions.
  • Figure 3: Semantic Class Ontology. We abbreviate the following semantic classes in Table \ref{['tab:unsup_ssc_tb']} for conciseness. Each patch on the left denotes the assigned color label for each class. We follow this color map for the visualizations in this work.
  • Figure 4: Comparison of Different Ground Truth Depths. We present depth images from three scenes constructed using various depth estimation strategies. We overlay the colorized depth map on the RGB image, where the color indicates the distance from the camera. The left column projects a single LiDAR scan onto the image. The middle column accumulates and projects the past 50 LiDAR scans to the image. The rightmost column uses our proposed depth estimation strategy, Stereo Depth Filtered LiDAR + Inverse Distance Weighting (SD+IDW).
  • Figure 5: Visualization of the Predicted BEV Representations From LSMap. We visualize the continuous representation predicted by LSMap for various scenes in CODa. We perform PCA dimensionality reduction to visualize the predicted semantic embeddings for each frame in RGB. For each scene, we show the RGB and LiDAR depth input to our model and predicted semantic and elevation embeddings on the left column. The right column shows the ground truth semantic class (top) and elevation (bottom). Notably, our model learns the following representation without human labels. Patches shaded in white correspond to unoccluded regions.
  • ...and 1 more figures