Table of Contents
Fetching ...

Event-Free Moving Object Segmentation from Moving Ego Vehicle

Zhuyun Zhou, Zongwei Wu, Danda Pani Paudel, Rémi Boutteau, Fan Yang, Luc Van Gool, Radu Timofte, Dominique Ginhac

TL;DR

This work tackles moving object segmentation from moving ego vehicles by leveraging event cameras to extract motion cues unreachable by frame-based methods. It introduces EmoFormer, which learns from event-derived priors during training while operating in an event-free manner at inference, and fuses low-rank RGB and motion priors with semantic guidance to isolate moving objects. A new large-scale dataset, DSEC-MOS, provides dense pixel-level MOS annotations for urban driving scenes and serves as a benchmark against state-of-the-art AVOS methods. The results show significant performance gains, highlighting the practical value of event-based supervision for robust MOS in dynamic driving scenarios.

Abstract

Moving object segmentation (MOS) in dynamic scenes is an important, challenging, but under-explored research topic for autonomous driving, especially for sequences obtained from moving ego vehicles. Most segmentation methods leverage motion cues obtained from optical flow maps. However, since these methods are often based on optical flows that are pre-computed from successive RGB frames, this neglects the temporal consideration of events occurring within the inter-frame, consequently constraining its ability to discern objects exhibiting relative staticity but genuinely in motion. To address these limitations, we propose to exploit event cameras for better video understanding, which provide rich motion cues without relying on optical flow. To foster research in this area, we first introduce a novel large-scale dataset called DSEC-MOS for moving object segmentation from moving ego vehicles, which is the first of its kind. For benchmarking, we select various mainstream methods and rigorously evaluate them on our dataset. Subsequently, we devise EmoFormer, a novel network able to exploit the event data. For this purpose, we fuse the event temporal prior with spatial semantic maps to distinguish genuinely moving objects from the static background, adding another level of dense supervision around our object of interest. Our proposed network relies only on event data for training but does not require event input during inference, making it directly comparable to frame-only methods in terms of efficiency and more widely usable in many application cases. The exhaustive comparison highlights a significant performance improvement of our method over all other methods. The source code and dataset are publicly available at: https://github.com/ZZY-Zhou/DSEC-MOS.

Event-Free Moving Object Segmentation from Moving Ego Vehicle

TL;DR

This work tackles moving object segmentation from moving ego vehicles by leveraging event cameras to extract motion cues unreachable by frame-based methods. It introduces EmoFormer, which learns from event-derived priors during training while operating in an event-free manner at inference, and fuses low-rank RGB and motion priors with semantic guidance to isolate moving objects. A new large-scale dataset, DSEC-MOS, provides dense pixel-level MOS annotations for urban driving scenes and serves as a benchmark against state-of-the-art AVOS methods. The results show significant performance gains, highlighting the practical value of event-based supervision for robust MOS in dynamic driving scenarios.

Abstract

Moving object segmentation (MOS) in dynamic scenes is an important, challenging, but under-explored research topic for autonomous driving, especially for sequences obtained from moving ego vehicles. Most segmentation methods leverage motion cues obtained from optical flow maps. However, since these methods are often based on optical flows that are pre-computed from successive RGB frames, this neglects the temporal consideration of events occurring within the inter-frame, consequently constraining its ability to discern objects exhibiting relative staticity but genuinely in motion. To address these limitations, we propose to exploit event cameras for better video understanding, which provide rich motion cues without relying on optical flow. To foster research in this area, we first introduce a novel large-scale dataset called DSEC-MOS for moving object segmentation from moving ego vehicles, which is the first of its kind. For benchmarking, we select various mainstream methods and rigorously evaluate them on our dataset. Subsequently, we devise EmoFormer, a novel network able to exploit the event data. For this purpose, we fuse the event temporal prior with spatial semantic maps to distinguish genuinely moving objects from the static background, adding another level of dense supervision around our object of interest. Our proposed network relies only on event data for training but does not require event input during inference, making it directly comparable to frame-only methods in terms of efficiency and more widely usable in many application cases. The exhaustive comparison highlights a significant performance improvement of our method over all other methods. The source code and dataset are publicly available at: https://github.com/ZZY-Zhou/DSEC-MOS.
Paper Structure (15 sections, 6 equations, 3 figures, 3 tables)

This paper contains 15 sections, 6 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: DSEC-MOS Examples and Visual Comparison. The top part (a) shows calibrated-to-Event RGB frames, and our DSEC-MOS Ground Truth Segmentation Masks visualized on calibrated RGB frames. The bottom part (b, c) shows that our dataset provides per-frame annotation and distinguishes the motion attributes, which are not available in the previous dataset xia2023cmda. Best zoomed in.
  • Figure 2: Architecture. In addition to the standard RGB Encoder-Decoder architecture, we introduce an auxiliary branch dedicated to harnessing the motion insights derived from the recorded event data (Sec. \ref{['generation']}). This learned representation is subsequently merged into the main processing pipeline, thereby enhancing feature modeling (Sec. \ref{['fusion']}). To further refine event-based learning and the understanding of object dynamics, we employ semantic maps to transfer global scene motion into targeting objects' motion (Sec. \ref{['superv']}). Such a filter strategy leads to a tightly coupled semantic-guided event awareness, ultimately shaping our joint learning scheme.
  • Figure 3: Qualitative Comparison. Our generated masks are closer to the Ground Truth (GT) compared to the counterparts. Please zoom in for details.