Table of Contents
Fetching ...

MouseSIS: A Frames-and-Events Dataset for Space-Time Instance Segmentation of Mice

Friedhelm Hamann, Hanxiong Li, Paul Mieske, Lars Lewejohann, Guillermo Gallego

TL;DR

MouseSIS introduces space-time instance segmentation (SIS) for event-based data by providing the first public dataset with pixel-accurate masks for up to seven mice, using aligned grayscale frames and events captured via a beamsplitter system. The work presents two baseline approaches—ModelMixSort (tracking-by-detection) and EventSeqFormer (tracking-by-query transformer)—to benchmark SIS with both modalities and their combination. Experimental results show that incorporating event data can improve tracking performance, though challenges remain in low-contrast, high-noise sequences and in integrating modalities end-to-end. The dataset (33 sequences, ~640 seconds total, ~75,000 masks) offers a valuable resource for developing robust, high-time-resolution tracking under difficult conditions and encourages broader application in biology and neuroscience. Overall, MouseSIS advances event-based scene understanding by enabling fine-grained, mask-level tracking across time, informing future development of space-time tracking algorithms.

Abstract

Enabled by large annotated datasets, tracking and segmentation of objects in videos has made remarkable progress in recent years. Despite these advancements, algorithms still struggle under degraded conditions and during fast movements. Event cameras are novel sensors with high temporal resolution and high dynamic range that offer promising advantages to address these challenges. However, annotated data for developing learning-based mask-level tracking algorithms with events is not available. To this end, we introduce: ($i$) a new task termed \emph{space-time instance segmentation}, similar to video instance segmentation, whose goal is to segment instances throughout the entire duration of the sensor input (here, the input are quasi-continuous events and optionally aligned frames); and ($ii$) \emph{\dname}, a dataset for the new task, containing aligned grayscale frames and events. It includes annotated ground-truth labels (pixel-level instance segmentation masks) of a group of up to seven freely moving and interacting mice. We also provide two reference methods, which show that leveraging event data can consistently improve tracking performance, especially when used in combination with conventional cameras. The results highlight the potential of event-aided tracking in difficult scenarios. We hope our dataset opens the field of event-based video instance segmentation and enables the development of robust tracking algorithms for challenging conditions.\url{https://github.com/tub-rip/MouseSIS}

MouseSIS: A Frames-and-Events Dataset for Space-Time Instance Segmentation of Mice

TL;DR

MouseSIS introduces space-time instance segmentation (SIS) for event-based data by providing the first public dataset with pixel-accurate masks for up to seven mice, using aligned grayscale frames and events captured via a beamsplitter system. The work presents two baseline approaches—ModelMixSort (tracking-by-detection) and EventSeqFormer (tracking-by-query transformer)—to benchmark SIS with both modalities and their combination. Experimental results show that incorporating event data can improve tracking performance, though challenges remain in low-contrast, high-noise sequences and in integrating modalities end-to-end. The dataset (33 sequences, ~640 seconds total, ~75,000 masks) offers a valuable resource for developing robust, high-time-resolution tracking under difficult conditions and encourages broader application in biology and neuroscience. Overall, MouseSIS advances event-based scene understanding by enabling fine-grained, mask-level tracking across time, informing future development of space-time tracking algorithms.

Abstract

Enabled by large annotated datasets, tracking and segmentation of objects in videos has made remarkable progress in recent years. Despite these advancements, algorithms still struggle under degraded conditions and during fast movements. Event cameras are novel sensors with high temporal resolution and high dynamic range that offer promising advantages to address these challenges. However, annotated data for developing learning-based mask-level tracking algorithms with events is not available. To this end, we introduce: () a new task termed \emph{space-time instance segmentation}, similar to video instance segmentation, whose goal is to segment instances throughout the entire duration of the sensor input (here, the input are quasi-continuous events and optionally aligned frames); and () \emph{\dname}, a dataset for the new task, containing aligned grayscale frames and events. It includes annotated ground-truth labels (pixel-level instance segmentation masks) of a group of up to seven freely moving and interacting mice. We also provide two reference methods, which show that leveraging event data can consistently improve tracking performance, especially when used in combination with conventional cameras. The results highlight the potential of event-aided tracking in difficult scenarios. We hope our dataset opens the field of event-based video instance segmentation and enables the development of robust tracking algorithms for challenging conditions.\url{https://github.com/tub-rip/MouseSIS}
Paper Structure (23 sections, 5 figures, 5 tables)

This paper contains 23 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The dataset contains high-definition (a) frames, (b) events, and (c) instance masks, which are consistent throughout a video. (d) As the overlay of them shows, events and frames are pixel-level aligned. Consequently, the masks are valid for both modalities.
  • Figure 2: (a) Our recording setup consists of several hardware-synchronized cameras. Images from the top view were used for the MouseSIS dataset. (b) The beamsplitter system for spatial alignment of frames and events.
  • Figure 3: Method 1 overview (ModelMixSort). Our method uses a tracking-by-detection approach. (a) for each frame and the according events, we extract two sets of boxes which are used as box prompts for SAM. (b) The instance masks are matched to trackers by predicting the current tracker by one timestep and matching it with the detections.
  • Figure 4: Method 2 overview (EventSeqFormer). This method uses a tracking-by-query approach. (a) We input an entire sequence of frames concatenated with E2VID images and divide them into smaller chunks with overlaps. (b) Within each chunk, we input the frames into SeqFormer for inference resulting in tracklets. (c) Due to the 20-frame overlap between each tracklet, we associate each tracklet by performing Hungarian matching based on the IoU of instances, ultimately producing a tracking result for the entire sequence.
  • Figure 5: Sample "snapshots" of our dataset and the corresponding predictions by the test methods.