Table of Contents
Fetching ...

Zero-Shot Multi-Animal Tracking in the Wild

Jan Frederik Meier, Timo Lüddecke

TL;DR

This work tackles zero-shot multi-animal tracking in the wild by adapting SAM2MOT to operate without retraining or hyperparameter tuning. It introduces three robust components—adaptive detection thresholds, mask-based track initialization, and density-aware reconstruction—built atop Grounding DINO and SAM 2 to generalize across diverse datasets. Across four benchmarks, the method achieves strong HOTA and association metrics, demonstrating reliable cross-domain performance and enabling scalable wildlife monitoring and behavioral analysis. The approach emphasizes practical applicability, highlighting accuracy and robustness while acknowledging runtime and scalability considerations in crowded scenes.

Abstract

Multi-animal tracking is crucial for understanding animal ecology and behavior. However, it remains a challenging task due to variations in habitat, motion patterns, and species appearance. Traditional approaches typically require extensive model fine-tuning and heuristic design for each application scenario. In this work, we explore the potential of recent vision foundation models for zero-shot multi-animal tracking. By combining a Grounding Dino object detector with the Segment Anything Model 2 (SAM 2) tracker and carefully designed heuristics, we develop a tracking framework that can be applied to new datasets without any retraining or hyperparameter adaptation. Evaluations on ChimpAct, Bird Flock Tracking, AnimalTrack, and a subset of GMOT-40 demonstrate strong and consistent performance across diverse species and environments. The code is available at https://github.com/ecker-lab/SAM2-Animal-Tracking.

Zero-Shot Multi-Animal Tracking in the Wild

TL;DR

This work tackles zero-shot multi-animal tracking in the wild by adapting SAM2MOT to operate without retraining or hyperparameter tuning. It introduces three robust components—adaptive detection thresholds, mask-based track initialization, and density-aware reconstruction—built atop Grounding DINO and SAM 2 to generalize across diverse datasets. Across four benchmarks, the method achieves strong HOTA and association metrics, demonstrating reliable cross-domain performance and enabling scalable wildlife monitoring and behavioral analysis. The approach emphasizes practical applicability, highlighting accuracy and robustness while acknowledging runtime and scalability considerations in crowded scenes.

Abstract

Multi-animal tracking is crucial for understanding animal ecology and behavior. However, it remains a challenging task due to variations in habitat, motion patterns, and species appearance. Traditional approaches typically require extensive model fine-tuning and heuristic design for each application scenario. In this work, we explore the potential of recent vision foundation models for zero-shot multi-animal tracking. By combining a Grounding Dino object detector with the Segment Anything Model 2 (SAM 2) tracker and carefully designed heuristics, we develop a tracking framework that can be applied to new datasets without any retraining or hyperparameter adaptation. Evaluations on ChimpAct, Bird Flock Tracking, AnimalTrack, and a subset of GMOT-40 demonstrate strong and consistent performance across diverse species and environments. The code is available at https://github.com/ecker-lab/SAM2-Animal-Tracking.

Paper Structure

This paper contains 32 sections, 5 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Overview of our model architecture. Blue components originate from SAM 2, orange components from SAM2MOT. We modify the dark orange modules and introduce the red ones to adapt the model for robust multi-animal tracking.
  • Figure 2: Detection score distributions and adaptive thresholding. (a) In-domain detector scores, showing varied effective thresholds. (b) Zero-shot detector scores, with higher threshold sensitivity to threshold selection. (c) Detection assignment using our K-Means–based adaptive thresholding, which automatically separates true positives from false positives without manual tuning.
  • Figure 3: Detection score distribution from different sequences of the ChimpAct test split. The distribution differs significantly between sequences
  • Figure 4: Runtime and VRAM requirements for different numbers of tracks. The inference speed and memory consumption of SAM 2 increase with the number of tracked objects, indicating limited scalability in crowded scenes.