Annolid: Annotate, Segment, and Track Anything You Need

Chen Yang; Thomas A. Cleland

Annolid: Annotate, Segment, and Track Anything You Need

Chen Yang, Thomas A. Cleland

TL;DR

The Cutie video object segmentation model is harnessed to achieve resilient, markerless tracking of multiple animals from single annotated frames, even in environments in which they may be partially or entirely concealed by environmental features or by one another.

Abstract

Annolid is a deep learning-based software package designed for the segmentation, labeling, and tracking of research targets within video files, focusing primarily on animal behavior analysis. Based on state-of-the-art instance segmentation methods, Annolid now harnesses the Cutie video object segmentation model to achieve resilient, markerless tracking of multiple animals from single annotated frames, even in environments in which they may be partially or entirely concealed by environmental features or by one another. Our integration of Segment Anything and Grounding-DINO strategies additionally enables the automatic masking and segmentation of recognizable animals and objects by text command, removing the need for manual annotation. Annolid's comprehensive approach to object segmentation flexibly accommodates a broad spectrum of behavior analysis applications, enabling the classification of diverse behavioral states such as freezing, digging, pup huddling, and social interactions in addition to the tracking of animals and their body parts.

Annolid: Annotate, Segment, and Track Anything You Need

TL;DR

Abstract

Paper Structure (28 sections, 13 figures, 3 tables)

This paper contains 28 sections, 13 figures, 3 tables.

Introduction
Methods
Computational Environment
Data Sources and Validation
Operational Principles of Grounding DINO and SAM
Automatic Object Detection and Segmentation in Annolid
Operational Principles of Cutie Video Object Segmentation
Mask to Polygon Conversion
Integrating Cutie into Annolid for Enhanced Tracking
The Annolid Annotation Framework
Segmentation and Prediction
Manual Intervention and Correction
Finalizing Predictions to Track Anything
Results
Evaluation with the MATB Dataset
...and 13 more sections

Figures (13)

Figure 1: Examples of multiple markerless animal tracking results in Annolid yang2023automated. Annolid now utilizes the Grounding-DINO liu2023grounding and Segment Anything kirillov2023seganysam_hqmobile_sam models to automatically segment and label all instances of a named class in an initial frame, and then leverages the Cutie cheng2023putting open-world video object segmentation (VOS) model to track multiple animals throughout video recordings based on that single labeled frame (zero-shot learning). Top: Based on the end user entering the text "ant" in the search field at the upper right, Annolid automatically segments all instances matching that label (i.e., ants) in the initial frame (left panel), and then tracks the labeled animals across frames throughout the video (middle and right panels). Middle: As in the top panel, except that seven zebrafish are tracked based on a single frame of autolabeled instances (text prompt "fish"). Bottom: As in the top panel, except that four mice are tracked based on a single frame of autolabeled instances (text prompt "mouse"). Annolid successfully tracked the mice and ants throughout each ten-minute video using only the polygons automatically generated in the first frame; zebrafish also were successfully tracked after incorporating human-in-the-loop corrections. Images are derived from videos in the idTracker.ai dataset romero2019idtracker.
Figure 2: Illustration of the Annolid GUI, and elements of its labeling, prediction, and validation workflow. The top row features a set of GUI tools including an open video button and a spin box for setting the mem_every parameter before initiating the prediction process with the Pred button. The text prompt box accepts words or phrases that define the automatic generation of polygons in the currently selected frame. Predicted polygons can be corrected manually, and labeling is saved in the LabelMe JSON file format.
Figure 3: Overview of the Cutie architecture cheng2023putting as integrated into Annolid. Labeled polygons are converted into masks from the currently selected frame and then stored in the FIFO memory buffer: specifically, pixel memory $F$ and object memory $S$, representing past segmented frames. Pixel memory is retrieved for the query frame as pixel readout $R_0$, which bidirectionally interacts with object queries $X$ and object memory $S$ in the object transformer. The object transformer comprises $L$ blocks that enrich the pixel features with object-level semantics and generate the final $R_L$ object readout for decoding into the output mask. Subsequently, the output mask is converted back to polygons for easy editing and visualization in the Annolid GUI.
Figure 5: Annolid performance on an idTracker.ai video romero2019idtracker featuring markerless tracking of six Drosophila fruit flies in an arena. From left to right: frames #1, #2000, and #4000 are shown. The complete video is available at https://youtu.be/uTs6CKgmdSw.
Figure : Mask for Epsilon Value 1.0
...and 8 more figures

Annolid: Annotate, Segment, and Track Anything You Need

TL;DR

Abstract

Annolid: Annotate, Segment, and Track Anything You Need

Authors

TL;DR

Abstract

Table of Contents

Figures (13)