Table of Contents
Fetching ...

Semantic Segmentation and Scene Reconstruction of RGB-D Image Frames: An End-to-End Modular Pipeline for Robotic Applications

Zhiwu Zheng, Lauren Mentzer, Berk Iskender, Michael Price, Colm Prendergast, Audren Cloitre

TL;DR

The paper presents an end-to-end modular pipeline for semantic segmentation and scene reconstruction of RGB-D frames tailored for robotics. It hybridizes SAM2-based mask generation with SegFormer/OneFormer semantic labeling to produce sharp masks and accurate labels, while a complementary human-tracking and semantic-guided point-cloud fusion module enables continuous tracking and efficient 3D reconstruction. Evaluation shows comparable semantic accuracy to state-of-the-art methods, improved mask quality, substantial runtime gains from semantic guidance, and a mean reconstruction error of about 25.3 mm on benchmark data, all stored in USD for easy querying and simulation. The results demonstrate practical viability for real-world robotic perception, navigation, and interaction tasks.

Abstract

Robots operating in unstructured environments require a comprehensive understanding of their surroundings, necessitating geometric and semantic information from sensor data. Traditional RGB-D processing pipelines focus primarily on geometric reconstruction, limiting their ability to support advanced robotic perception, planning, and interaction. A key challenge is the lack of generalized methods for segmenting RGB-D data into semantically meaningful components while maintaining accurate geometric representations. We introduce a novel end-to-end modular pipeline that integrates state-of-the-art semantic segmentation, human tracking, point-cloud fusion, and scene reconstruction. Our approach improves semantic segmentation accuracy by leveraging the foundational segmentation model SAM2 with a hybrid method that combines its mask generation with a semantic classification model, resulting in sharper masks and high classification accuracy. Compared to SegFormer and OneFormer, our method achieves a similar semantic segmentation accuracy (mIoU of 47.0% vs 45.9% in the ADE20K dataset) but provides much more precise object boundaries. Additionally, our human tracking algorithm interacts with the segmentation enabling continuous tracking even when objects leave and re-enter the frame by object re-identification. Our point cloud fusion approach reduces computation time by 1.81x while maintaining a small mean reconstruction error of 25.3 mm by leveraging the semantic information. We validate our approach on benchmark datasets and real-world Kinect RGB-D data, demonstrating improved efficiency, accuracy, and usability. Our structured representation, stored in the Universal Scene Description (USD) format, supports efficient querying, visualization, and robotic simulation, making it practical for real-world deployment.

Semantic Segmentation and Scene Reconstruction of RGB-D Image Frames: An End-to-End Modular Pipeline for Robotic Applications

TL;DR

The paper presents an end-to-end modular pipeline for semantic segmentation and scene reconstruction of RGB-D frames tailored for robotics. It hybridizes SAM2-based mask generation with SegFormer/OneFormer semantic labeling to produce sharp masks and accurate labels, while a complementary human-tracking and semantic-guided point-cloud fusion module enables continuous tracking and efficient 3D reconstruction. Evaluation shows comparable semantic accuracy to state-of-the-art methods, improved mask quality, substantial runtime gains from semantic guidance, and a mean reconstruction error of about 25.3 mm on benchmark data, all stored in USD for easy querying and simulation. The results demonstrate practical viability for real-world robotic perception, navigation, and interaction tasks.

Abstract

Robots operating in unstructured environments require a comprehensive understanding of their surroundings, necessitating geometric and semantic information from sensor data. Traditional RGB-D processing pipelines focus primarily on geometric reconstruction, limiting their ability to support advanced robotic perception, planning, and interaction. A key challenge is the lack of generalized methods for segmenting RGB-D data into semantically meaningful components while maintaining accurate geometric representations. We introduce a novel end-to-end modular pipeline that integrates state-of-the-art semantic segmentation, human tracking, point-cloud fusion, and scene reconstruction. Our approach improves semantic segmentation accuracy by leveraging the foundational segmentation model SAM2 with a hybrid method that combines its mask generation with a semantic classification model, resulting in sharper masks and high classification accuracy. Compared to SegFormer and OneFormer, our method achieves a similar semantic segmentation accuracy (mIoU of 47.0% vs 45.9% in the ADE20K dataset) but provides much more precise object boundaries. Additionally, our human tracking algorithm interacts with the segmentation enabling continuous tracking even when objects leave and re-enter the frame by object re-identification. Our point cloud fusion approach reduces computation time by 1.81x while maintaining a small mean reconstruction error of 25.3 mm by leveraging the semantic information. We validate our approach on benchmark datasets and real-world Kinect RGB-D data, demonstrating improved efficiency, accuracy, and usability. Our structured representation, stored in the Universal Scene Description (USD) format, supports efficient querying, visualization, and robotic simulation, making it practical for real-world deployment.

Paper Structure

This paper contains 11 sections, 1 equation, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Semantic segmentation and scene reconstruction from color and depth images.
  • Figure 2: Our pipeline for segmenting and structuring RGB-D data.
  • Figure 3: Approach to semantic segmentation. Semantic and mask branches are combined by a voting process that give both right labels and fine edges in masks.
  • Figure 4: The voting process combines the results of the semantic and mask branches.
  • Figure 5: Visualization of the point cloud merging algorithm inspired by SAM3D. Orange and green circles represent 3D points captured at $t=0$ and $t=1$, respectively. Downsampling of the final point cloud is not depicted.
  • ...and 7 more figures