Table of Contents
Fetching ...

Mesh-based Object Tracking for Dynamic Semantic 3D Scene Graphs via Ray Tracing

Lennart Niecksch, Alexander Mock, Felix Igelbrink, Thomas Wiemann, Joachim Hertzberg

TL;DR

The paper addresses robust dynamic semantic understanding in 3D scenes by tracking known object instances over time and representing them in a mesh-based geometric scene graph. It combines YOLOv8s-based keypoint detection and PnP pose estimation with Mesh-ICP ray-tracing for robust 6D pose refinement against depth data, feeding the tracked meshes into SEMAP for spatio-semantic reasoning. Key contributions include a real-time-friendly, mesh-centric scene graph for dynamic objects, a pre-segmentation capability from sensor data, and a demonstration on a Tiago robot. The approach enables live spatial reasoning and planning support in dynamic environments, with potential for GPU-accelerated backends to reach higher frame rates, and a runtime complexity of $O(n \, \cdot \, \log(m))$ for the correspondence search.

Abstract

In this paper, we present a novel method for 3D geometric scene graph generation using range sensors and RGB cameras. We initially detect instance-wise keypoints with a YOLOv8s model to compute 6D pose estimates of known objects by solving PnP. We use a ray tracing approach to track a geometric scene graph consisting of mesh models of object instances. In contrast to classical point-to-point matching, this leads to more robust results, especially under occlusions between objects instances. We show that using this hybrid strategy leads to robust self-localization, pre-segmentation of the range sensor data and accurate pose tracking of objects using the same environmental representation. All detected objects are integrated into a semantic scene graph. This scene graph then serves as a front end to a semantic mapping framework to allow spatial reasoning.

Mesh-based Object Tracking for Dynamic Semantic 3D Scene Graphs via Ray Tracing

TL;DR

The paper addresses robust dynamic semantic understanding in 3D scenes by tracking known object instances over time and representing them in a mesh-based geometric scene graph. It combines YOLOv8s-based keypoint detection and PnP pose estimation with Mesh-ICP ray-tracing for robust 6D pose refinement against depth data, feeding the tracked meshes into SEMAP for spatio-semantic reasoning. Key contributions include a real-time-friendly, mesh-centric scene graph for dynamic objects, a pre-segmentation capability from sensor data, and a demonstration on a Tiago robot. The approach enables live spatial reasoning and planning support in dynamic environments, with potential for GPU-accelerated backends to reach higher frame rates, and a runtime complexity of for the correspondence search.

Abstract

In this paper, we present a novel method for 3D geometric scene graph generation using range sensors and RGB cameras. We initially detect instance-wise keypoints with a YOLOv8s model to compute 6D pose estimates of known objects by solving PnP. We use a ray tracing approach to track a geometric scene graph consisting of mesh models of object instances. In contrast to classical point-to-point matching, this leads to more robust results, especially under occlusions between objects instances. We show that using this hybrid strategy leads to robust self-localization, pre-segmentation of the range sensor data and accurate pose tracking of objects using the same environmental representation. All detected objects are integrated into a semantic scene graph. This scene graph then serves as a front end to a semantic mapping framework to allow spatial reasoning.
Paper Structure (9 sections, 4 figures, 1 table)

This paper contains 9 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Our test platform Tiago inspecting a table top scene containing multiple objects.
  • Figure 2: \ref{['fig:keypoints']} The scene and the detected keypoints and boxes. \ref{['fig:correspondences']} The initial pose estimates (green) and the ray tracing correspondences (purple). \ref{['fig:corrected']} The refined poses of the object instances (blue).
  • Figure 3: By tracking the geometric scene graph we can inversely determine all the points that are unknown and use it as pre-segmentation for other methods. The left image shows the Tiago robot operating inside a real environment with a matched scene graph in the right image, which is only composed of walls and doors. The points that are not considered for tracking (red) give a pre-segmentation of the sensor data that can help subsequent object detectors to produce more reliable results.
  • Figure 4: Result of SEMAP queries on (red) and left-of (cyan) using the refined object poses. Best viewed in color. The red box shows the top projection abstraction of the bottom powerdrill and the cyan box the left projection abstraction of the multimeter.