The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification
Dante Francisco Wasmuht, Otto Brookes, Maximillian Schall, Pablo Palencia, Chris Beirne, Tilo Burghardt, Majid Mirmehdi, Hjalmar Kühl, Mimi Arandjelovic, Sam Pottie, Peter Bermant, Brandon Asheim, Yi Jin Toh, Adam Elzinga, Jason Holmberg, Andrew Whitworth, Eleanor Flatt, Laura Gustafson, Chaitanya Ryali, Yuan-Ting Hu, Baishan Guo, Andrew Westbury, Kate Saenko, Didac Suris
TL;DR
SA-FARI addresses a critical gap in wildlife multi-animal tracking by introducing a large-scale, diverse, and densely annotated open-source dataset collected from 741 camera-trap sites across 4 continents, covering 99 species. It provides exhaustively labeled masklets and anonymized locations, enabling robust benchmarking with open-vocabulary prompts via SAM3 and other baselines. The paper demonstrates substantial gains when SA-FARI data are used for training or fine-tuning state-of-the-art models (e.g., $cgF1$, $pHOTA$, $TETA$ gains of up to $32.9$, $19.6$, and $19.1$ respectively), underscoring the value of detailed domain-specific data for open-world MAT. It also introduces a principled evaluation framework including category augmentation and diverse test splits, and outlines future directions toward multi-modal extensions and broader ecological coverage to accelerate conservation-oriented video understanding.
Abstract
Automated video analysis is critical for wildlife conservation. A foundational task in this domain is multi-animal tracking (MAT), which underpins applications such as individual re-identification and behavior recognition. However, existing datasets are limited in scale, constrained to a few species, or lack sufficient temporal and geographical diversity - leaving no suitable benchmark for training general-purpose MAT models applicable across wild animal populations. To address this, we introduce SA-FARI, the largest open-source MAT dataset for wild animals. It comprises 11,609 camera trap videos collected over approximately 10 years (2014-2024) from 741 locations across 4 continents, spanning 99 species categories. Each video is exhaustively annotated culminating in ~46 hours of densely annotated footage containing 16,224 masklet identities and 942,702 individual bounding boxes, segmentation masks, and species labels. Alongside the task-specific annotations, we publish anonymized camera trap locations for each video. Finally, we present comprehensive benchmarks on SA-FARI using state-of-the-art vision-language models for detection and tracking, including SAM 3, evaluated with both species-specific and generic animal prompts. We also compare against vision-only methods developed specifically for wildlife analysis. SA-FARI is the first large-scale dataset to combine high species diversity, multi-region coverage, and high-quality spatio-temporal annotations, offering a new foundation for advancing generalizable multianimal tracking in the wild. The dataset is available at https://www.conservationxlabs.com/sa-fari.
