GorillaWatch: An Automated System for In-the-Wild Gorilla Re-Identification and Population Monitoring
Maximilian Schall, Felix Leonard Knöfel, Noah Elias König, Jan Jonas Kubeler, Maximilian von Klinski, Joan Wilhelm Linnemann, Xiaoshi Liu, Iven Jelle Schlegelmilch, Ole Woyciniuk, Alexandra Schild, Dante Wasmuht, Magdalena Bermejo Espinet, German Illera Basas, Gerard de Melo
TL;DR
Automating gorilla re-identification and population monitoring from camera-trap footage is hampered by manual labeling and a lack of in-the-wild datasets. The authors present GorillaWatch, an end-to-end pipeline augmented with multi-frame self-supervised pretraining and differentiable explainability, evaluated on three new benchmarks: Gorilla-SPAC-Wild, Gorilla-Berlin-Zoo, and Gorilla-SPAC-MoT. Key contributions include a comprehensive open-world dataset suite, temporal pretraining that improves re-ID by up to 11%, a differentiable AttnLRP-based faithfulness check, PTAM for tracklet aggregation, and constrained clustering to improve population counting. Results show that aggregating features from large-scale image backbones with temporal pretraining outperforms specialized video architectures in data-scarce wildlife settings, enabling scalable, non-invasive monitoring and providing a practical foundation for conservation analytics; all code and datasets are released for community use.
Abstract
Monitoring critically endangered western lowland gorillas is currently hampered by the immense manual effort required to re-identify individuals from vast archives of camera trap footage. The primary obstacle to automating this process has been the lack of large-scale, "in-the-wild" video datasets suitable for training robust deep learning models. To address this gap, we introduce a comprehensive benchmark with three novel datasets: Gorilla-SPAC-Wild, the largest video dataset for wild primate re-identification to date; Gorilla-Berlin-Zoo, for assessing cross-domain re-identification generalization; and Gorilla-SPAC-MoT, for evaluating multi-object tracking in camera trap footage. Building on these datasets, we present GorillaWatch, an end-to-end pipeline integrating detection, tracking, and re-identification. To exploit temporal information, we introduce a multi-frame self-supervised pretraining strategy that leverages consistency in tracklets to learn domain-specific features without manual labels. To ensure scientific validity, a differentiable adaptation of AttnLRP verifies that our model relies on discriminative biometric traits rather than background correlations. Extensive benchmarking subsequently demonstrates that aggregating features from large-scale image backbones outperforms specialized video architectures. Finally, we address unsupervised population counting by integrating spatiotemporal constraints into standard clustering to mitigate over-segmentation. We publicly release all code and datasets to facilitate scalable, non-invasive monitoring of endangered species
