Table of Contents
Fetching ...

Out of the Room: Generalizing Event-Based Dynamic Motion Segmentation for Complex Scenes

Stamatios Georgoulis, Weining Ren, Alfredo Bochicchio, Daniel Eckert, Yuanyou Li, Abel Gawel

TL;DR

This paper addresses motion segmentation in dynamic, real-world scenes for mobile sensing, where RGB-based methods struggle due to under-specification and unknown categories. It proposes a divide-and-conquer, event-based approach that first ego-motion-compensates events using monocular depth and 6DoF pose estimates, supplements with dense optical flow, and then applies a transformer-based temporal attention-enhanced segmentation model to produce temporally consistent masks. Key contributions include (i) ego-motion compensation of event data, (ii) parallel optical-flow cues, (iii) a transformer-based temporal attention module, and (iv) a new outdoor dataset DSEC-MOTS, achieving state-of-the-art results on EV-IMO and large gains on DSEC-MOTS. The findings demonstrate that combining ego-motion compensation, optical flow, and temporal attention yields robust, class-agnostic motion segmentation in complex outdoor scenes with practical implications for autonomous navigation and robotics.

Abstract

Rapid and reliable identification of dynamic scene parts, also known as motion segmentation, is a key challenge for mobile sensors. Contemporary RGB camera-based methods rely on modeling camera and scene properties however, are often under-constrained and fall short in unknown categories. Event cameras have the potential to overcome these limitations, but corresponding methods have only been demonstrated in smaller-scale indoor environments with simplified dynamic objects. This work presents an event-based method for class-agnostic motion segmentation that can successfully be deployed across complex large-scale outdoor environments too. To this end, we introduce a novel divide-and-conquer pipeline that combines: (a) ego-motion compensated events, computed via a scene understanding module that predicts monocular depth and camera pose as auxiliary tasks, and (b) optical flow from a dedicated optical flow module. These intermediate representations are then fed into a segmentation module that predicts motion segmentation masks. A novel transformer-based temporal attention module in the segmentation module builds correlations across adjacent 'frames' to get temporally consistent segmentation masks. Our method sets the new state-of-the-art on the classic EV-IMO benchmark (indoors), where we achieve improvements of 2.19 moving object IoU (2.22 mIoU) and 4.52 point IoU respectively, as well as on a newly-generated motion segmentation and tracking benchmark (outdoors) based on the DSEC event dataset, termed DSEC-MOTS, where we show improvement of 12.91 moving object IoU.

Out of the Room: Generalizing Event-Based Dynamic Motion Segmentation for Complex Scenes

TL;DR

This paper addresses motion segmentation in dynamic, real-world scenes for mobile sensing, where RGB-based methods struggle due to under-specification and unknown categories. It proposes a divide-and-conquer, event-based approach that first ego-motion-compensates events using monocular depth and 6DoF pose estimates, supplements with dense optical flow, and then applies a transformer-based temporal attention-enhanced segmentation model to produce temporally consistent masks. Key contributions include (i) ego-motion compensation of event data, (ii) parallel optical-flow cues, (iii) a transformer-based temporal attention module, and (iv) a new outdoor dataset DSEC-MOTS, achieving state-of-the-art results on EV-IMO and large gains on DSEC-MOTS. The findings demonstrate that combining ego-motion compensation, optical flow, and temporal attention yields robust, class-agnostic motion segmentation in complex outdoor scenes with practical implications for autonomous navigation and robotics.

Abstract

Rapid and reliable identification of dynamic scene parts, also known as motion segmentation, is a key challenge for mobile sensors. Contemporary RGB camera-based methods rely on modeling camera and scene properties however, are often under-constrained and fall short in unknown categories. Event cameras have the potential to overcome these limitations, but corresponding methods have only been demonstrated in smaller-scale indoor environments with simplified dynamic objects. This work presents an event-based method for class-agnostic motion segmentation that can successfully be deployed across complex large-scale outdoor environments too. To this end, we introduce a novel divide-and-conquer pipeline that combines: (a) ego-motion compensated events, computed via a scene understanding module that predicts monocular depth and camera pose as auxiliary tasks, and (b) optical flow from a dedicated optical flow module. These intermediate representations are then fed into a segmentation module that predicts motion segmentation masks. A novel transformer-based temporal attention module in the segmentation module builds correlations across adjacent 'frames' to get temporally consistent segmentation masks. Our method sets the new state-of-the-art on the classic EV-IMO benchmark (indoors), where we achieve improvements of 2.19 moving object IoU (2.22 mIoU) and 4.52 point IoU respectively, as well as on a newly-generated motion segmentation and tracking benchmark (outdoors) based on the DSEC event dataset, termed DSEC-MOTS, where we show improvement of 12.91 moving object IoU.
Paper Structure (21 sections, 4 equations, 5 figures, 4 tables)

This paper contains 21 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: In this paper, we generalize event-based motion segmentation to large-scale outdoor scenes. To realize this, we propose two key features: (a) Motion Compensation. Unlike current works zhang2023multimitrokhin2019ev that utilize the raw event representation (e.g. voxel grid) as input to motion segmentation and let the network figure out both ego-motion and dynamic object motion at once, we argue that ego-motion compensating the event representation (by predicting depth and 6DoF pose) is a necessary pre-processing step to motion segmentation, as it makes static regions sharper (green box) while leaving dynamic regions blurry (red box); b) Temporal Attention. Due to the inherent noisy and jittery nature of events, which can disappear and re-appear between adjacent time steps (red box), it is crucial to incorporate temporal consistency modules into motion segmentation. Together, these features greatly boost the system's overall performance.
  • Figure 2: System overview: Our pipeline adopts a divide-and-conquer approach that operates on three steps, namely ego-motion compensation (see Section \ref{['subsec:mot_comp']}), optical flow estimation (see Section \ref{['subsec:flow']}, and motion segmentation (see Section \ref{['subsec:seg']}). First, ego-motion compensated events (backward and forward) are computed by warping the input event representation using the predicted depth maps and 6DoF camera pose. Second, optical flow (backward and forward) is estimated from the input event representation in parallel. Third, the ego-motion compensated events are concatenated with the optical flow and fed as input to the motion segmentation network that predicts the motion segmentation masks (backward and forward). A Temporal Attention Module that applies channel and spatial attention across the hidden states of different timestamps ($t$, $t-1$) is employed inside the motion segmentation network for temporally consistent motion masks.
  • Figure 3: Temporal Attention Module (TAM). The TAM that applies channel and spatial attention across the hidden states of different timestamps ($t$, $t-1$) is employed inside the motion segmentation network for temporally consistent motion masks.
  • Figure 4: Comparative qualitative analysis of the baseline model and variations of our model. From left to right: ego-motion compensated events, RGB image (only for reference), EV-IMO mitrokhin2019ev with ECN backbone ye2020unsupervised segmentation mask, our segmentation mask (only ego-motion compensation), our segmentation mask (ego-motion compensation plus optical flow), our segmentation mask (full model), and ground truth motion segmentation mask.
  • Figure 5: Qualitative comparison between our algorithm and 0-MMS parameshwara20210 on the EV-IMO benchmark. From top to bottom: RGB image, 0-MMS event segmentation, our event segmentation, and our segmentation mask. The first two rows are taken from the 0-MMS parameshwara20210 paper.