Table of Contents
Fetching ...

SemanticFlow: A Self-Supervised Framework for Joint Scene Flow Prediction and Instance Segmentation in Dynamic Environments

Yinqi Chen, Meiying Zhang, Qi Hao, Guang Zhou

TL;DR

SemanticFlow tackles dynamic scene understanding by jointly predicting 3D scene flow and instance segmentation from consecutive point clouds in a self-supervised, multi-task framework. It introduces a coarse-to-fine strategy with a shared backbone and a suite of interdependent losses to enforce motion-semantics consistency, reinforced by a self-supervised pseudo-labeling pipeline. Empirical results on Waymo and Argoverse 2 show competitive scene flow accuracy and enhanced segmentation metrics, with notable robustness under limited labeling. The approach demonstrates practical impact for downstream autonomous driving tasks such as SLAM, obstacle avoidance, and planning, while reducing annotation reliance.

Abstract

Accurate perception of dynamic traffic scenes is crucial for high-level autonomous driving systems, requiring robust object motion estimation and instance segmentation. However, traditional methods often treat them as separate tasks, leading to suboptimal performance, spatio-temporal inconsistencies, and inefficiency in complex scenarios due to the absence of information sharing. This paper proposes a multi-task SemanticFlow framework to simultaneously predict scene flow and instance segmentation of full-resolution point clouds. The novelty of this work is threefold: 1) developing a coarse-to-fine prediction based multi-task scheme, where an initial coarse segmentation of static backgrounds and dynamic objects is used to provide contextual information for refining motion and semantic information through a shared feature processing module; 2) developing a set of loss functions to enhance the performance of scene flow estimation and instance segmentation, while can help ensure spatial and temporal consistency of both static and dynamic objects within traffic scenes; 3) developing a self-supervised learning scheme, which utilizes coarse segmentation to detect rigid objects and compute their transformation matrices between sequential frames, enabling the generation of self-supervised labels. The proposed framework is validated on the Argoverse and Waymo datasets, demonstrating superior performance in instance segmentation accuracy, scene flow estimation, and computational efficiency, establishing a new benchmark for self-supervised methods in dynamic scene understanding.

SemanticFlow: A Self-Supervised Framework for Joint Scene Flow Prediction and Instance Segmentation in Dynamic Environments

TL;DR

SemanticFlow tackles dynamic scene understanding by jointly predicting 3D scene flow and instance segmentation from consecutive point clouds in a self-supervised, multi-task framework. It introduces a coarse-to-fine strategy with a shared backbone and a suite of interdependent losses to enforce motion-semantics consistency, reinforced by a self-supervised pseudo-labeling pipeline. Empirical results on Waymo and Argoverse 2 show competitive scene flow accuracy and enhanced segmentation metrics, with notable robustness under limited labeling. The approach demonstrates practical impact for downstream autonomous driving tasks such as SLAM, obstacle avoidance, and planning, while reducing annotation reliance.

Abstract

Accurate perception of dynamic traffic scenes is crucial for high-level autonomous driving systems, requiring robust object motion estimation and instance segmentation. However, traditional methods often treat them as separate tasks, leading to suboptimal performance, spatio-temporal inconsistencies, and inefficiency in complex scenarios due to the absence of information sharing. This paper proposes a multi-task SemanticFlow framework to simultaneously predict scene flow and instance segmentation of full-resolution point clouds. The novelty of this work is threefold: 1) developing a coarse-to-fine prediction based multi-task scheme, where an initial coarse segmentation of static backgrounds and dynamic objects is used to provide contextual information for refining motion and semantic information through a shared feature processing module; 2) developing a set of loss functions to enhance the performance of scene flow estimation and instance segmentation, while can help ensure spatial and temporal consistency of both static and dynamic objects within traffic scenes; 3) developing a self-supervised learning scheme, which utilizes coarse segmentation to detect rigid objects and compute their transformation matrices between sequential frames, enabling the generation of self-supervised labels. The proposed framework is validated on the Argoverse and Waymo datasets, demonstrating superior performance in instance segmentation accuracy, scene flow estimation, and computational efficiency, establishing a new benchmark for self-supervised methods in dynamic scene understanding.

Paper Structure

This paper contains 28 sections, 11 equations, 4 figures, 3 tables, 2 algorithms.

Figures (4)

  • Figure 1: An illustration of SemanticFlow, which estimates scene flow and performs instance segmentation from two consecutive point clouds, enabling applications in SLAMbahraini2018slam, obstacle detection, decision planning, tracking and etc..
  • Figure 2: An illustration of the SemanticFlow system diagram.
  • Figure 3: An illustration of $\mathcal{L}_{\text{Rigid}}$. The operator $\odot$, $\oplus$, and $\ominus$ denote element-wise multiplication, addition, and subtraction respectively, and $\otimes$ denotes matrix multiplication.
  • Figure 4: Comparison of different clustering methods for scene flow segmentation. False negatives (FN) are marked in red, and false positives (FP) are marked in blue.