Table of Contents
Fetching ...

GorillaWatch: An Automated System for In-the-Wild Gorilla Re-Identification and Population Monitoring

Maximilian Schall, Felix Leonard Knöfel, Noah Elias König, Jan Jonas Kubeler, Maximilian von Klinski, Joan Wilhelm Linnemann, Xiaoshi Liu, Iven Jelle Schlegelmilch, Ole Woyciniuk, Alexandra Schild, Dante Wasmuht, Magdalena Bermejo Espinet, German Illera Basas, Gerard de Melo

TL;DR

Automating gorilla re-identification and population monitoring from camera-trap footage is hampered by manual labeling and a lack of in-the-wild datasets. The authors present GorillaWatch, an end-to-end pipeline augmented with multi-frame self-supervised pretraining and differentiable explainability, evaluated on three new benchmarks: Gorilla-SPAC-Wild, Gorilla-Berlin-Zoo, and Gorilla-SPAC-MoT. Key contributions include a comprehensive open-world dataset suite, temporal pretraining that improves re-ID by up to 11%, a differentiable AttnLRP-based faithfulness check, PTAM for tracklet aggregation, and constrained clustering to improve population counting. Results show that aggregating features from large-scale image backbones with temporal pretraining outperforms specialized video architectures in data-scarce wildlife settings, enabling scalable, non-invasive monitoring and providing a practical foundation for conservation analytics; all code and datasets are released for community use.

Abstract

Monitoring critically endangered western lowland gorillas is currently hampered by the immense manual effort required to re-identify individuals from vast archives of camera trap footage. The primary obstacle to automating this process has been the lack of large-scale, "in-the-wild" video datasets suitable for training robust deep learning models. To address this gap, we introduce a comprehensive benchmark with three novel datasets: Gorilla-SPAC-Wild, the largest video dataset for wild primate re-identification to date; Gorilla-Berlin-Zoo, for assessing cross-domain re-identification generalization; and Gorilla-SPAC-MoT, for evaluating multi-object tracking in camera trap footage. Building on these datasets, we present GorillaWatch, an end-to-end pipeline integrating detection, tracking, and re-identification. To exploit temporal information, we introduce a multi-frame self-supervised pretraining strategy that leverages consistency in tracklets to learn domain-specific features without manual labels. To ensure scientific validity, a differentiable adaptation of AttnLRP verifies that our model relies on discriminative biometric traits rather than background correlations. Extensive benchmarking subsequently demonstrates that aggregating features from large-scale image backbones outperforms specialized video architectures. Finally, we address unsupervised population counting by integrating spatiotemporal constraints into standard clustering to mitigate over-segmentation. We publicly release all code and datasets to facilitate scalable, non-invasive monitoring of endangered species

GorillaWatch: An Automated System for In-the-Wild Gorilla Re-Identification and Population Monitoring

TL;DR

Automating gorilla re-identification and population monitoring from camera-trap footage is hampered by manual labeling and a lack of in-the-wild datasets. The authors present GorillaWatch, an end-to-end pipeline augmented with multi-frame self-supervised pretraining and differentiable explainability, evaluated on three new benchmarks: Gorilla-SPAC-Wild, Gorilla-Berlin-Zoo, and Gorilla-SPAC-MoT. Key contributions include a comprehensive open-world dataset suite, temporal pretraining that improves re-ID by up to 11%, a differentiable AttnLRP-based faithfulness check, PTAM for tracklet aggregation, and constrained clustering to improve population counting. Results show that aggregating features from large-scale image backbones with temporal pretraining outperforms specialized video architectures in data-scarce wildlife settings, enabling scalable, non-invasive monitoring and providing a practical foundation for conservation analytics; all code and datasets are released for community use.

Abstract

Monitoring critically endangered western lowland gorillas is currently hampered by the immense manual effort required to re-identify individuals from vast archives of camera trap footage. The primary obstacle to automating this process has been the lack of large-scale, "in-the-wild" video datasets suitable for training robust deep learning models. To address this gap, we introduce a comprehensive benchmark with three novel datasets: Gorilla-SPAC-Wild, the largest video dataset for wild primate re-identification to date; Gorilla-Berlin-Zoo, for assessing cross-domain re-identification generalization; and Gorilla-SPAC-MoT, for evaluating multi-object tracking in camera trap footage. Building on these datasets, we present GorillaWatch, an end-to-end pipeline integrating detection, tracking, and re-identification. To exploit temporal information, we introduce a multi-frame self-supervised pretraining strategy that leverages consistency in tracklets to learn domain-specific features without manual labels. To ensure scientific validity, a differentiable adaptation of AttnLRP verifies that our model relies on discriminative biometric traits rather than background correlations. Extensive benchmarking subsequently demonstrates that aggregating features from large-scale image backbones outperforms specialized video architectures. Finally, we address unsupervised population counting by integrating spatiotemporal constraints into standard clustering to mitigate over-segmentation. We publicly release all code and datasets to facilitate scalable, non-invasive monitoring of endangered species

Paper Structure

This paper contains 24 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Overview of our complete re-identification pipeline.
  • Figure 2: Sample images from the Gorilla-SPAC-Wild dataset (top row) and the Gorilla-Berlin-Zoo dataset (bottom row).
  • Figure 3: Zero-shot performance comparison of different pre-pretrained models on the Gorilla-SPAC-Wild test set.
  • Figure 4: Top-1 accuracy of DINOv2Giant after supervised fine-tuning on the Gorilla-SPAC-Wild training set.
  • Figure 5: Effect of Multi-frame Pretraining: We evaluate performance across single-frame, 4-frame, and 10-frame temporal sequences. The green bars denote accuracy immediately after pretraining, while the brown bars indicate the accuracy change ($\Delta$) after fine-tuning.
  • ...and 2 more figures