Table of Contents
Fetching ...

The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification

Dante Francisco Wasmuht, Otto Brookes, Maximillian Schall, Pablo Palencia, Chris Beirne, Tilo Burghardt, Majid Mirmehdi, Hjalmar Kühl, Mimi Arandjelovic, Sam Pottie, Peter Bermant, Brandon Asheim, Yi Jin Toh, Adam Elzinga, Jason Holmberg, Andrew Whitworth, Eleanor Flatt, Laura Gustafson, Chaitanya Ryali, Yuan-Ting Hu, Baishan Guo, Andrew Westbury, Kate Saenko, Didac Suris

TL;DR

SA-FARI addresses a critical gap in wildlife multi-animal tracking by introducing a large-scale, diverse, and densely annotated open-source dataset collected from 741 camera-trap sites across 4 continents, covering 99 species. It provides exhaustively labeled masklets and anonymized locations, enabling robust benchmarking with open-vocabulary prompts via SAM3 and other baselines. The paper demonstrates substantial gains when SA-FARI data are used for training or fine-tuning state-of-the-art models (e.g., $cgF1$, $pHOTA$, $TETA$ gains of up to $32.9$, $19.6$, and $19.1$ respectively), underscoring the value of detailed domain-specific data for open-world MAT. It also introduces a principled evaluation framework including category augmentation and diverse test splits, and outlines future directions toward multi-modal extensions and broader ecological coverage to accelerate conservation-oriented video understanding.

Abstract

Automated video analysis is critical for wildlife conservation. A foundational task in this domain is multi-animal tracking (MAT), which underpins applications such as individual re-identification and behavior recognition. However, existing datasets are limited in scale, constrained to a few species, or lack sufficient temporal and geographical diversity - leaving no suitable benchmark for training general-purpose MAT models applicable across wild animal populations. To address this, we introduce SA-FARI, the largest open-source MAT dataset for wild animals. It comprises 11,609 camera trap videos collected over approximately 10 years (2014-2024) from 741 locations across 4 continents, spanning 99 species categories. Each video is exhaustively annotated culminating in ~46 hours of densely annotated footage containing 16,224 masklet identities and 942,702 individual bounding boxes, segmentation masks, and species labels. Alongside the task-specific annotations, we publish anonymized camera trap locations for each video. Finally, we present comprehensive benchmarks on SA-FARI using state-of-the-art vision-language models for detection and tracking, including SAM 3, evaluated with both species-specific and generic animal prompts. We also compare against vision-only methods developed specifically for wildlife analysis. SA-FARI is the first large-scale dataset to combine high species diversity, multi-region coverage, and high-quality spatio-temporal annotations, offering a new foundation for advancing generalizable multianimal tracking in the wild. The dataset is available at https://www.conservationxlabs.com/sa-fari.

The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification

TL;DR

SA-FARI addresses a critical gap in wildlife multi-animal tracking by introducing a large-scale, diverse, and densely annotated open-source dataset collected from 741 camera-trap sites across 4 continents, covering 99 species. It provides exhaustively labeled masklets and anonymized locations, enabling robust benchmarking with open-vocabulary prompts via SAM3 and other baselines. The paper demonstrates substantial gains when SA-FARI data are used for training or fine-tuning state-of-the-art models (e.g., , , gains of up to , , and respectively), underscoring the value of detailed domain-specific data for open-world MAT. It also introduces a principled evaluation framework including category augmentation and diverse test splits, and outlines future directions toward multi-modal extensions and broader ecological coverage to accelerate conservation-oriented video understanding.

Abstract

Automated video analysis is critical for wildlife conservation. A foundational task in this domain is multi-animal tracking (MAT), which underpins applications such as individual re-identification and behavior recognition. However, existing datasets are limited in scale, constrained to a few species, or lack sufficient temporal and geographical diversity - leaving no suitable benchmark for training general-purpose MAT models applicable across wild animal populations. To address this, we introduce SA-FARI, the largest open-source MAT dataset for wild animals. It comprises 11,609 camera trap videos collected over approximately 10 years (2014-2024) from 741 locations across 4 continents, spanning 99 species categories. Each video is exhaustively annotated culminating in ~46 hours of densely annotated footage containing 16,224 masklet identities and 942,702 individual bounding boxes, segmentation masks, and species labels. Alongside the task-specific annotations, we publish anonymized camera trap locations for each video. Finally, we present comprehensive benchmarks on SA-FARI using state-of-the-art vision-language models for detection and tracking, including SAM 3, evaluated with both species-specific and generic animal prompts. We also compare against vision-only methods developed specifically for wildlife analysis. SA-FARI is the first large-scale dataset to combine high species diversity, multi-region coverage, and high-quality spatio-temporal annotations, offering a new foundation for advancing generalizable multianimal tracking in the wild. The dataset is available at https://www.conservationxlabs.com/sa-fari.

Paper Structure

This paper contains 30 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: SA-FARI Dataset Overview and Annotation. We 1) collect camera trap videos from 741 independent sampling locations across 4 continents, 2) label them with 99 species categories, and 3) exhaustively manually annotate spatio-temporal masklets for each individual animal. Each video includes frame-level annotations, resulting in 16,224 unique identity masklets across $\sim$46 hours of video that form the by far largest dataset of its kind. Its rich annotations enable robust benchmarking of multi-animal tracking methods and support the development of generalizable, spatially accurate video understanding for wildlife.
  • Figure 2: Sample Skims from the SA-FARI Dataset. Each video–species pair is annotated with an exhaustive spatio-temporal segmentation of all animals belonging to that species category. The dataset captures a wide range of challenging scenarios, including: multiple animals in the same scene (a–d), occlusions between animals or with other scene elements (e), animals reappearing after leaving the frame (f), unconventional or partial views (g), small animals (c), nighttime conditions (b, e, g), and camouflaged animals (h).
  • Figure 3: Taxonomic Abundance of Videos. Each circle represents the relative abundance (i.e. number of videos) of taxonomic groups within the SA-FARI dataset, from Class to Order to Family. Circle size is proportional to the number of associated videos, while edge colors indicate the continent(s) where the taxa were recorded. This visualisation highlights both the taxonomic and geographic diversity captured in the dataset.
  • Figure 4: Masklet Statistics. Distribution of key masklet-level metrics across the SA-FARI dataset. Average masklet size and average IoU are computed by first averaging within each masklet and then across all masklets in a video. The total number of masklets and occlusion events are summed per video. Dotted lines indicate the medians of each distribution.
  • Figure 5: Distribution of Species Category in the SA-FARI Dataset. The two panels show the number of videos per species category, broken down by data split. The distribution follows a long-tailed pattern typical of real-world wildlife datasets, with a few dominant species and many rarely observed ones. Notably, several species, such as the Saki monkey, appear only in the test set, reflecting the natural open-world setting of camera trap deployments.
  • ...and 2 more figures