Out of the Room: Generalizing Event-Based Dynamic Motion Segmentation for Complex Scenes
Stamatios Georgoulis, Weining Ren, Alfredo Bochicchio, Daniel Eckert, Yuanyou Li, Abel Gawel
TL;DR
This paper addresses motion segmentation in dynamic, real-world scenes for mobile sensing, where RGB-based methods struggle due to under-specification and unknown categories. It proposes a divide-and-conquer, event-based approach that first ego-motion-compensates events using monocular depth and 6DoF pose estimates, supplements with dense optical flow, and then applies a transformer-based temporal attention-enhanced segmentation model to produce temporally consistent masks. Key contributions include (i) ego-motion compensation of event data, (ii) parallel optical-flow cues, (iii) a transformer-based temporal attention module, and (iv) a new outdoor dataset DSEC-MOTS, achieving state-of-the-art results on EV-IMO and large gains on DSEC-MOTS. The findings demonstrate that combining ego-motion compensation, optical flow, and temporal attention yields robust, class-agnostic motion segmentation in complex outdoor scenes with practical implications for autonomous navigation and robotics.
Abstract
Rapid and reliable identification of dynamic scene parts, also known as motion segmentation, is a key challenge for mobile sensors. Contemporary RGB camera-based methods rely on modeling camera and scene properties however, are often under-constrained and fall short in unknown categories. Event cameras have the potential to overcome these limitations, but corresponding methods have only been demonstrated in smaller-scale indoor environments with simplified dynamic objects. This work presents an event-based method for class-agnostic motion segmentation that can successfully be deployed across complex large-scale outdoor environments too. To this end, we introduce a novel divide-and-conquer pipeline that combines: (a) ego-motion compensated events, computed via a scene understanding module that predicts monocular depth and camera pose as auxiliary tasks, and (b) optical flow from a dedicated optical flow module. These intermediate representations are then fed into a segmentation module that predicts motion segmentation masks. A novel transformer-based temporal attention module in the segmentation module builds correlations across adjacent 'frames' to get temporally consistent segmentation masks. Our method sets the new state-of-the-art on the classic EV-IMO benchmark (indoors), where we achieve improvements of 2.19 moving object IoU (2.22 mIoU) and 4.52 point IoU respectively, as well as on a newly-generated motion segmentation and tracking benchmark (outdoors) based on the DSEC event dataset, termed DSEC-MOTS, where we show improvement of 12.91 moving object IoU.
