Table of Contents
Fetching ...

SOccDPT: Semi-Supervised 3D Semantic Occupancy from Dense Prediction Transformers trained under memory constraints

Aditya Nalgunda Ganesh

TL;DR

SOccDPT addresses 3D semantic occupancy from monocular images under memory constraints in unstructured traffic by combining a Dense Prediction Transformer backbone with dual disparity and semantic heads. It leverages semi-supervised learning through pseudo-ground truth, using depth boosting and semantic auto-labelling to augment IDD and Bengaluru datasets, and employs a PatchWise training scheme to fit limited hardware. The approach achieves competitive real-time performance (≈69.5 Hz) and strong 3D semantic metrics (RMSE ≈9.15, IoU ≈46.0%) on challenging, non-structured traffic, while producing a Bengaluru Semantic Occupancy Dataset. This work demonstrates practical memory-efficient 3D perception for autonomous navigation and provides public code and data to foster further research.

Abstract

We present SOccDPT, a memory-efficient approach for 3D semantic occupancy prediction from monocular image input using dense prediction transformers. To address the limitations of existing methods trained on structured traffic datasets, we train our model on unstructured datasets including the Indian Driving Dataset and Bengaluru Driving Dataset. Our semi-supervised training pipeline allows SOccDPT to learn from datasets with limited labels by reducing the requirement for manual labelling by substituting it with pseudo-ground truth labels to produce our Bengaluru Semantic Occupancy Dataset. This broader training enhances our model's ability to handle unstructured traffic scenarios effectively. To overcome memory limitations during training, we introduce patch-wise training where we select a subset of parameters to train each epoch, reducing memory usage during auto-grad graph construction. In the context of unstructured traffic and memory-constrained training and inference, SOccDPT outperforms existing disparity estimation approaches as shown by the RMSE score of 9.1473, achieves a semantic segmentation IoU score of 46.02% and operates at a competitive frequency of 69.47 Hz. We make our code and semantic occupancy dataset public.

SOccDPT: Semi-Supervised 3D Semantic Occupancy from Dense Prediction Transformers trained under memory constraints

TL;DR

SOccDPT addresses 3D semantic occupancy from monocular images under memory constraints in unstructured traffic by combining a Dense Prediction Transformer backbone with dual disparity and semantic heads. It leverages semi-supervised learning through pseudo-ground truth, using depth boosting and semantic auto-labelling to augment IDD and Bengaluru datasets, and employs a PatchWise training scheme to fit limited hardware. The approach achieves competitive real-time performance (≈69.5 Hz) and strong 3D semantic metrics (RMSE ≈9.15, IoU ≈46.0%) on challenging, non-structured traffic, while producing a Bengaluru Semantic Occupancy Dataset. This work demonstrates practical memory-efficient 3D perception for autonomous navigation and provides public code and data to foster further research.

Abstract

We present SOccDPT, a memory-efficient approach for 3D semantic occupancy prediction from monocular image input using dense prediction transformers. To address the limitations of existing methods trained on structured traffic datasets, we train our model on unstructured datasets including the Indian Driving Dataset and Bengaluru Driving Dataset. Our semi-supervised training pipeline allows SOccDPT to learn from datasets with limited labels by reducing the requirement for manual labelling by substituting it with pseudo-ground truth labels to produce our Bengaluru Semantic Occupancy Dataset. This broader training enhances our model's ability to handle unstructured traffic scenarios effectively. To overcome memory limitations during training, we introduce patch-wise training where we select a subset of parameters to train each epoch, reducing memory usage during auto-grad graph construction. In the context of unstructured traffic and memory-constrained training and inference, SOccDPT outperforms existing disparity estimation approaches as shown by the RMSE score of 9.1473, achieves a semantic segmentation IoU score of 46.02% and operates at a competitive frequency of 69.47 Hz. We make our code and semantic occupancy dataset public.
Paper Structure (12 sections, 1 equation, 6 figures, 2 tables, 1 algorithm)

This paper contains 12 sections, 1 equation, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: Above are a few frames from our Bengaluru Semantic Occupancy Dataset which is an extension of the Bengaluru Driving Dataset OCTraN_analgund2023octran. Each panel consists of the RGB image with 2D semantic labels on the top left, the disparity map on the bottom left and the 3D semantic occupancy on the right. The vehicle and pedestrian classes are colored in blue and red respectively. Objects without classes have been plotted as a height map for the sake of visualization. The vehicle and its future trajectory have been plotted out in grey and green respectively to aid the reader to have a better scene understanding.
  • Figure 2: SOccDPT uses the ViT family for backbone feature extraction which allows us to carefully balance accuracy and compute requirements. SOccDPT takes an RGB image input of shape $3 \times 256 \times 256$ produces image features of shape 256x128x128. We then pass the extracted features to a disparity head and a segmentation head. We apply the Scale and Shift Invariant loss Ranftl2022_midas_ssi_loss and the Binary Cross Entropy loss for the disparity and segmentation outputs respectively. With the known camera intrinsic, we project the semantics into 3D space with the help of the disparity map and accumulate the semantics into a 3D occupancy grid of size $256 \times 256 \times 32$, thus producing a 3D semantic map from one backbone
  • Figure 3: Qualitative results comparing frames in BDD to Midas versions, monodepth, manydepth, ZeroDepth. As we can see, all the existing approaches do not address the diversity that is seen in unstructured traffic
  • Figure 4: We use Depth Boosting to generate depth labels for the Indian Driving Dataset. We have the RGB frames on the left, segmentation map in the middle and our depth labels on the right. We would like to highlight the detail in the automatically generated disparity maps
  • Figure 5: We use Semantic Segmentation auto-labelling to generate semantic labels for the Bengaluru Driving Dataset. We have the RGB frames on the left, our segmentation maps in the middle and depth labels on the right. We would like to highlight the accuracy in the automatically generated segmentation maps
  • ...and 1 more figures