Recurrence over Video Frames (RoVF) for the Re-identification of Meerkats
Mitchell Rogers, Kobe Knowles, Gaël Gendron, Shahrokh Heidari, David Arturo Soriano Valdez, Mihailo Azhar, Padriac O'Leary, Simon Eyre, Michael Witbrock, Patrice Delmas
TL;DR
This work targets open-set animal re-identification from video without ground-truth IDs by introducing Recurrence over Video Frames (RoVF), a model that adds a Perceiver recurrent head to a pre-trained image transformer (DINOv2) to iteratively build a video embedding. It employs a hard triplet mining strategy tailored to unlabeled IDs and evaluates on a new meerkat dataset collected in a zoo, demonstrating a top-1 accuracy of $0.49$, outperforming the best DINOv2 baseline at $0.42$. The study provides a dataset generation protocol, a novel training scheme, and a proof-of-concept that video-based transformers can capture id-specific cues beyond static appearance, with implications for conservation and behavioral monitoring. Future directions include pre-text tasks, behavior classification, and rigorous hyperparameter optimization to further enhance performance and robustness.
Abstract
Deep learning approaches for animal re-identification have had a major impact on conservation, significantly reducing the time required for many downstream tasks, such as well-being monitoring. We propose a method called Recurrence over Video Frames (RoVF), which uses a recurrent head based on the Perceiver architecture to iteratively construct an embedding from a video clip. RoVF is trained using triplet loss based on the co-occurrence of individuals in the video frames, where the individual IDs are unavailable. We tested this method and various models based on the DINOv2 transformer architecture on a dataset of meerkats collected at the Wellington Zoo. Our method achieves a top-1 re-identification accuracy of $49\%$, which is higher than that of the best DINOv2 model ($42\%$). We found that the model can match observations of individuals where humans cannot, and our model (RoVF) performs better than the comparisons with minimal fine-tuning. In future work, we plan to improve these models by using pre-text tasks, apply them to animal behaviour classification, and perform a hyperparameter search to optimise the models further.
