Table of Contents
Fetching ...

Instance-Level Moving Object Segmentation from a Single Image with Events

Zhexiong Wan, Bin Fan, Le Hui, Yuchao Dai, Gim Hee Lee

TL;DR

This work introduces InsMOS, the first instance-level moving object segmentation framework that fuses a single image with asynchronous event data to segment multiple independently moving objects under ego-motion. The method employs cross-modal masked attention (CMA) to merge texture from images with motion cues from events, augmented by explicit contrastive feature learning (CFL) and a flow-guided feature enhancement (FFE) module to reinforce motion representations. It decouples mask generation from motion classification, enabling varying object counts, and uses a Hungarian assignment for alignment with ground-truth masks. Extensive experiments on EVIMO and EKubric demonstrate strong performance gains over unimodal methods, with real-time efficiency and robust handling of complex dynamics, including camera motion and dense object configurations. The results highlight the practical potential of combining image and event data for dense, instance-level motion segmentation and point to future directions in multimodal event-based perception.

Abstract

Moving object segmentation plays a crucial role in understanding dynamic scenes involving multiple moving objects, while the difficulties lie in taking into account both spatial texture structures and temporal motion cues. Existing methods based on video frames encounter difficulties in distinguishing whether pixel displacements of an object are caused by camera motion or object motion due to the complexities of accurate image-based motion modeling. Recent advances exploit the motion sensitivity of novel event cameras to counter conventional images' inadequate motion modeling capabilities, but instead lead to challenges in segmenting pixel-level object masks due to the lack of dense texture structures in events. To address these two limitations imposed by unimodal settings, we propose the first instance-level moving object segmentation framework that integrates complementary texture and motion cues. Our model incorporates implicit cross-modal masked attention augmentation, explicit contrastive feature learning, and flow-guided motion enhancement to exploit dense texture information from a single image and rich motion information from events, respectively. By leveraging the augmented texture and motion features, we separate mask segmentation from motion classification to handle varying numbers of independently moving objects. Through extensive evaluations on multiple datasets, as well as ablation experiments with different input settings and real-time efficiency analysis of the proposed framework, we believe that our first attempt to incorporate image and event data for practical deployment can provide new insights for future work in event-based motion related works. The source code with model training and pre-trained weights is released at https://npucvr.github.io/EvInsMOS

Instance-Level Moving Object Segmentation from a Single Image with Events

TL;DR

This work introduces InsMOS, the first instance-level moving object segmentation framework that fuses a single image with asynchronous event data to segment multiple independently moving objects under ego-motion. The method employs cross-modal masked attention (CMA) to merge texture from images with motion cues from events, augmented by explicit contrastive feature learning (CFL) and a flow-guided feature enhancement (FFE) module to reinforce motion representations. It decouples mask generation from motion classification, enabling varying object counts, and uses a Hungarian assignment for alignment with ground-truth masks. Extensive experiments on EVIMO and EKubric demonstrate strong performance gains over unimodal methods, with real-time efficiency and robust handling of complex dynamics, including camera motion and dense object configurations. The results highlight the practical potential of combining image and event data for dense, instance-level motion segmentation and point to future directions in multimodal event-based perception.

Abstract

Moving object segmentation plays a crucial role in understanding dynamic scenes involving multiple moving objects, while the difficulties lie in taking into account both spatial texture structures and temporal motion cues. Existing methods based on video frames encounter difficulties in distinguishing whether pixel displacements of an object are caused by camera motion or object motion due to the complexities of accurate image-based motion modeling. Recent advances exploit the motion sensitivity of novel event cameras to counter conventional images' inadequate motion modeling capabilities, but instead lead to challenges in segmenting pixel-level object masks due to the lack of dense texture structures in events. To address these two limitations imposed by unimodal settings, we propose the first instance-level moving object segmentation framework that integrates complementary texture and motion cues. Our model incorporates implicit cross-modal masked attention augmentation, explicit contrastive feature learning, and flow-guided motion enhancement to exploit dense texture information from a single image and rich motion information from events, respectively. By leveraging the augmented texture and motion features, we separate mask segmentation from motion classification to handle varying numbers of independently moving objects. Through extensive evaluations on multiple datasets, as well as ablation experiments with different input settings and real-time efficiency analysis of the proposed framework, we believe that our first attempt to incorporate image and event data for practical deployment can provide new insights for future work in event-based motion related works. The source code with model training and pre-trained weights is released at https://npucvr.github.io/EvInsMOS

Paper Structure

This paper contains 16 sections, 12 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: A challenging example of MOS with multiple independently moving objects (IMOs). When multiple static and moving objects coexist in the view of a moving camera, there are two major challenges: 1) static objects projected onto 2D tend to shift relative to the background due to depth parallax, and 2) static objects still present pixel displacements in the image view and also trigger events in the event camera view. These make it difficult to distinguish actually moving objects in such complex dynamic scenes. We compare IDOL vis:wu_defenseIDOL_ECCV_2022 and ZBS cd:an_zbs_CVPR_2023, which use only image inputs, with our two ablation models that use events only and two images as inputs, respectively. The events are visualized with red indicating brightness increase and blue indicating decrease. From the comparisons, using only images as input is susceptible to interference from camera motion, which could lead to misjudgment of static objects as moving. Additionally, due to the lack of dense texture information in events, moving objects may be misidentified as a single object when they move nearby. In contrast, our model can effectively integrate the advantages of the dense texture from image data and the rich motion from event data for accurate segmentation of IMOs.
  • Figure 2: Our proposed InsMOS framework combines a single image with its subsequent events. The network pipeline is divided into three parts: 1) The cross-modal masked attention augmentation (CMA) module interactively augments texture and motion representations with an additional contrastive feature learning mechanism applied in training. 2) Masks and motion embeddings are decoded separately, allowing for thresholding instance-level segmentation outputs. Since the training loss is applied on full embeddings, this thresholding step only needs to be performed during inference. 3) The flow-guided motion feature enhancement module is designed to enhance motion feature learning during training.
  • Figure 3: Diagram of our multi-frame contrastive feature learning. Our feature learning consists of two parts. One is to maximize the feature consistency across frames (green arrows) for the same batch and modality to keep the consistency of the identical modality features. One is to minimize the self-similarity and cross-similarity across batches (blue and orange arrows) to learn complementary information across modalities. The bottom-left upward curved arrow indicates maximizing the consistency measurement to reduce the complementarity information between identical features. While the bottom-right downward curved arrow indicates minimizing the similarity measurements to increase the complementary information. Note that only similarities between adjacent batches and frames are drawn for brevity, whereas actually every batch and every frame is considered.
  • Figure 4: Visual comparisons on the real EVIMO eventdatasets:mitrokhin_EVIMO_IROS_2019 dataset. Our method can segment all IMOs more accurately, especially when they are close together.
  • Figure 5: Visual comparisons on the simulated EKubric eventflow:Wan_RPEFlow_ICCV_2023 dataset. We compare two image-based methods, IDOL and ZBS, in the upper two samples, and two ablation models, replacing the input data with only events or images, in the bottom two samples.
  • ...and 3 more figures