Recurrence over Video Frames (RoVF) for the Re-identification of Meerkats

Mitchell Rogers; Kobe Knowles; Gaël Gendron; Shahrokh Heidari; David Arturo Soriano Valdez; Mihailo Azhar; Padriac O'Leary; Simon Eyre; Michael Witbrock; Patrice Delmas

Recurrence over Video Frames (RoVF) for the Re-identification of Meerkats

Mitchell Rogers, Kobe Knowles, Gaël Gendron, Shahrokh Heidari, David Arturo Soriano Valdez, Mihailo Azhar, Padriac O'Leary, Simon Eyre, Michael Witbrock, Patrice Delmas

TL;DR

This work targets open-set animal re-identification from video without ground-truth IDs by introducing Recurrence over Video Frames (RoVF), a model that adds a Perceiver recurrent head to a pre-trained image transformer (DINOv2) to iteratively build a video embedding. It employs a hard triplet mining strategy tailored to unlabeled IDs and evaluates on a new meerkat dataset collected in a zoo, demonstrating a top-1 accuracy of $0.49$, outperforming the best DINOv2 baseline at $0.42$. The study provides a dataset generation protocol, a novel training scheme, and a proof-of-concept that video-based transformers can capture id-specific cues beyond static appearance, with implications for conservation and behavioral monitoring. Future directions include pre-text tasks, behavior classification, and rigorous hyperparameter optimization to further enhance performance and robustness.

Abstract

Deep learning approaches for animal re-identification have had a major impact on conservation, significantly reducing the time required for many downstream tasks, such as well-being monitoring. We propose a method called Recurrence over Video Frames (RoVF), which uses a recurrent head based on the Perceiver architecture to iteratively construct an embedding from a video clip. RoVF is trained using triplet loss based on the co-occurrence of individuals in the video frames, where the individual IDs are unavailable. We tested this method and various models based on the DINOv2 transformer architecture on a dataset of meerkats collected at the Wellington Zoo. Our method achieves a top-1 re-identification accuracy of $49\%$, which is higher than that of the best DINOv2 model ($42\%$). We found that the model can match observations of individuals where humans cannot, and our model (RoVF) performs better than the comparisons with minimal fine-tuning. In future work, we plan to improve these models by using pre-text tasks, apply them to animal behaviour classification, and perform a hyperparameter search to optimise the models further.

Recurrence over Video Frames (RoVF) for the Re-identification of Meerkats

TL;DR

, outperforming the best DINOv2 baseline at

. The study provides a dataset generation protocol, a novel training scheme, and a proof-of-concept that video-based transformers can capture id-specific cues beyond static appearance, with implications for conservation and behavioral monitoring. Future directions include pre-text tasks, behavior classification, and rigorous hyperparameter optimization to further enhance performance and robustness.

Abstract

, which is higher than that of the best DINOv2 model (

). We found that the model can match observations of individuals where humans cannot, and our model (RoVF) performs better than the comparisons with minimal fine-tuning. In future work, we plan to improve these models by using pre-text tasks, apply them to animal behaviour classification, and perform a hyperparameter search to optimise the models further.

Paper Structure (9 sections, 3 figures, 1 table, 1 algorithm)

This paper contains 9 sections, 3 figures, 1 table, 1 algorithm.

Introduction
Methodology
Dataset creation
Triplet loss
Model architecture
Evaluation metrics
Experiments
Results
Conclusion

Figures (3)

Figure 1: Example positive and negative sets. The first frame of 20 positive clips of the same meerkat (left half) and 20 negative clips of other meerkats (right half). The anchor (green), positive (orange), and negative (red) have been selected based on embeddings from the training ResNet model.
Figure 2: Recurrence over Video Frames (RoVF) is an architecture that adds a recurrent component on top of an existing image model---that outputs image/frame embeddings---allowing representations over a video to be constructed. The Recurrent architecture iteratively, over frames, builds a representation of the video from the image model's embeddings for a frame; after the last frame, a video embedding is outputted by the recurrent model.
Figure 3: Examples of incorrect (red) and correct (green) re-identifications of a query clip (left-most column) using the best RoVF model. The embedding distance between the query and gallery clip is shown underneath each thumbnail.

Recurrence over Video Frames (RoVF) for the Re-identification of Meerkats

TL;DR

Abstract

Recurrence over Video Frames (RoVF) for the Re-identification of Meerkats

Authors

TL;DR

Abstract

Table of Contents

Figures (3)