Semantic Segmentation and Scene Reconstruction of RGB-D Image Frames: An End-to-End Modular Pipeline for Robotic Applications

Zhiwu Zheng; Lauren Mentzer; Berk Iskender; Michael Price; Colm Prendergast; Audren Cloitre

Semantic Segmentation and Scene Reconstruction of RGB-D Image Frames: An End-to-End Modular Pipeline for Robotic Applications

Zhiwu Zheng, Lauren Mentzer, Berk Iskender, Michael Price, Colm Prendergast, Audren Cloitre

TL;DR

The paper presents an end-to-end modular pipeline for semantic segmentation and scene reconstruction of RGB-D frames tailored for robotics. It hybridizes SAM2-based mask generation with SegFormer/OneFormer semantic labeling to produce sharp masks and accurate labels, while a complementary human-tracking and semantic-guided point-cloud fusion module enables continuous tracking and efficient 3D reconstruction. Evaluation shows comparable semantic accuracy to state-of-the-art methods, improved mask quality, substantial runtime gains from semantic guidance, and a mean reconstruction error of about 25.3 mm on benchmark data, all stored in USD for easy querying and simulation. The results demonstrate practical viability for real-world robotic perception, navigation, and interaction tasks.

Abstract

Robots operating in unstructured environments require a comprehensive understanding of their surroundings, necessitating geometric and semantic information from sensor data. Traditional RGB-D processing pipelines focus primarily on geometric reconstruction, limiting their ability to support advanced robotic perception, planning, and interaction. A key challenge is the lack of generalized methods for segmenting RGB-D data into semantically meaningful components while maintaining accurate geometric representations. We introduce a novel end-to-end modular pipeline that integrates state-of-the-art semantic segmentation, human tracking, point-cloud fusion, and scene reconstruction. Our approach improves semantic segmentation accuracy by leveraging the foundational segmentation model SAM2 with a hybrid method that combines its mask generation with a semantic classification model, resulting in sharper masks and high classification accuracy. Compared to SegFormer and OneFormer, our method achieves a similar semantic segmentation accuracy (mIoU of 47.0% vs 45.9% in the ADE20K dataset) but provides much more precise object boundaries. Additionally, our human tracking algorithm interacts with the segmentation enabling continuous tracking even when objects leave and re-enter the frame by object re-identification. Our point cloud fusion approach reduces computation time by 1.81x while maintaining a small mean reconstruction error of 25.3 mm by leveraging the semantic information. We validate our approach on benchmark datasets and real-world Kinect RGB-D data, demonstrating improved efficiency, accuracy, and usability. Our structured representation, stored in the Universal Scene Description (USD) format, supports efficient querying, visualization, and robotic simulation, making it practical for real-world deployment.

Semantic Segmentation and Scene Reconstruction of RGB-D Image Frames: An End-to-End Modular Pipeline for Robotic Applications

TL;DR

Abstract

Semantic Segmentation and Scene Reconstruction of RGB-D Image Frames: An End-to-End Modular Pipeline for Robotic Applications

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)