Event-Free Moving Object Segmentation from Moving Ego Vehicle
Zhuyun Zhou, Zongwei Wu, Danda Pani Paudel, Rémi Boutteau, Fan Yang, Luc Van Gool, Radu Timofte, Dominique Ginhac
TL;DR
This work tackles moving object segmentation from moving ego vehicles by leveraging event cameras to extract motion cues unreachable by frame-based methods. It introduces EmoFormer, which learns from event-derived priors during training while operating in an event-free manner at inference, and fuses low-rank RGB and motion priors with semantic guidance to isolate moving objects. A new large-scale dataset, DSEC-MOS, provides dense pixel-level MOS annotations for urban driving scenes and serves as a benchmark against state-of-the-art AVOS methods. The results show significant performance gains, highlighting the practical value of event-based supervision for robust MOS in dynamic driving scenarios.
Abstract
Moving object segmentation (MOS) in dynamic scenes is an important, challenging, but under-explored research topic for autonomous driving, especially for sequences obtained from moving ego vehicles. Most segmentation methods leverage motion cues obtained from optical flow maps. However, since these methods are often based on optical flows that are pre-computed from successive RGB frames, this neglects the temporal consideration of events occurring within the inter-frame, consequently constraining its ability to discern objects exhibiting relative staticity but genuinely in motion. To address these limitations, we propose to exploit event cameras for better video understanding, which provide rich motion cues without relying on optical flow. To foster research in this area, we first introduce a novel large-scale dataset called DSEC-MOS for moving object segmentation from moving ego vehicles, which is the first of its kind. For benchmarking, we select various mainstream methods and rigorously evaluate them on our dataset. Subsequently, we devise EmoFormer, a novel network able to exploit the event data. For this purpose, we fuse the event temporal prior with spatial semantic maps to distinguish genuinely moving objects from the static background, adding another level of dense supervision around our object of interest. Our proposed network relies only on event data for training but does not require event input during inference, making it directly comparable to frame-only methods in terms of efficiency and more widely usable in many application cases. The exhaustive comparison highlights a significant performance improvement of our method over all other methods. The source code and dataset are publicly available at: https://github.com/ZZY-Zhou/DSEC-MOS.
