Instance-Level Moving Object Segmentation from a Single Image with Events
Zhexiong Wan, Bin Fan, Le Hui, Yuchao Dai, Gim Hee Lee
TL;DR
This work introduces InsMOS, the first instance-level moving object segmentation framework that fuses a single image with asynchronous event data to segment multiple independently moving objects under ego-motion. The method employs cross-modal masked attention (CMA) to merge texture from images with motion cues from events, augmented by explicit contrastive feature learning (CFL) and a flow-guided feature enhancement (FFE) module to reinforce motion representations. It decouples mask generation from motion classification, enabling varying object counts, and uses a Hungarian assignment for alignment with ground-truth masks. Extensive experiments on EVIMO and EKubric demonstrate strong performance gains over unimodal methods, with real-time efficiency and robust handling of complex dynamics, including camera motion and dense object configurations. The results highlight the practical potential of combining image and event data for dense, instance-level motion segmentation and point to future directions in multimodal event-based perception.
Abstract
Moving object segmentation plays a crucial role in understanding dynamic scenes involving multiple moving objects, while the difficulties lie in taking into account both spatial texture structures and temporal motion cues. Existing methods based on video frames encounter difficulties in distinguishing whether pixel displacements of an object are caused by camera motion or object motion due to the complexities of accurate image-based motion modeling. Recent advances exploit the motion sensitivity of novel event cameras to counter conventional images' inadequate motion modeling capabilities, but instead lead to challenges in segmenting pixel-level object masks due to the lack of dense texture structures in events. To address these two limitations imposed by unimodal settings, we propose the first instance-level moving object segmentation framework that integrates complementary texture and motion cues. Our model incorporates implicit cross-modal masked attention augmentation, explicit contrastive feature learning, and flow-guided motion enhancement to exploit dense texture information from a single image and rich motion information from events, respectively. By leveraging the augmented texture and motion features, we separate mask segmentation from motion classification to handle varying numbers of independently moving objects. Through extensive evaluations on multiple datasets, as well as ablation experiments with different input settings and real-time efficiency analysis of the proposed framework, we believe that our first attempt to incorporate image and event data for practical deployment can provide new insights for future work in event-based motion related works. The source code with model training and pre-trained weights is released at https://npucvr.github.io/EvInsMOS
