Table of Contents
Fetching ...

MARINE: A Computer Vision Model for Detecting Rare Predator-Prey Interactions in Animal Videos

Zsófia Katona, Seyed Sahand Mohammadi Ziabari, Fatemeh Karimi Nejadasl

TL;DR

MARINE addresses the challenge of detecting rare predator–prey interactions in animal videos by integrating motion-based frame selection with pre-trained DINOv2 embeddings and a lightweight classifier head. Evaluated on a coral reef fish dataset and an AK fish subset, MARINE G14 consistently outperforms VideoMAE in action recognition and achieves strong, though variable, temporal localization at a lowered $t$-IoU threshold of $0.25$. The results demonstrate MARINE as a viable starter framework for animal AR/AD, while revealing limitations in multi-label generalization and suggesting avenues for improvements such as advanced frame selection, actor-focused embeddings, and multi-modal cues. Overall, the work advances wildlife video understanding by providing a modular, efficient approach with clear directions for extension to broader ecological data.

Abstract

Encounters between predator and prey play an essential role in ecosystems, but their rarity makes them difficult to detect in video recordings. Although advances in action recognition (AR) and temporal action detection (AD), especially transformer-based models and vision foundation models, have achieved high performance on human action datasets, animal videos remain relatively under-researched. This thesis addresses this gap by proposing the model MARINE, which utilizes motion-based frame selection designed for fast animal actions and DINOv2 feature extraction with a trainable classification head for action recognition. MARINE outperforms VideoMAE in identifying predator attacks in videos of fish, both on a small and specific coral reef dataset (81.53\% against 52.64\% accuracy), and on a subset of the more extensive Animal Kingdom dataset (94.86\% against 83.14\% accuracy). In a multi-label setting on a representative sample of Animal Kingdom, MARINE achieves 23.79\% mAP, positioning it mid-field among existing benchmarks. Furthermore, in an AD task on the coral reef dataset, MARINE achieves 80.78\% AP (against VideoMAE's 34.89\%) although at a lowered t-IoU threshold of 25\%. Therefore, despite room for improvement, MARINE offers an effective starter framework to apply to AR and AD tasks on animal recordings and thus contribute to the study of natural ecosystems.

MARINE: A Computer Vision Model for Detecting Rare Predator-Prey Interactions in Animal Videos

TL;DR

MARINE addresses the challenge of detecting rare predator–prey interactions in animal videos by integrating motion-based frame selection with pre-trained DINOv2 embeddings and a lightweight classifier head. Evaluated on a coral reef fish dataset and an AK fish subset, MARINE G14 consistently outperforms VideoMAE in action recognition and achieves strong, though variable, temporal localization at a lowered -IoU threshold of . The results demonstrate MARINE as a viable starter framework for animal AR/AD, while revealing limitations in multi-label generalization and suggesting avenues for improvements such as advanced frame selection, actor-focused embeddings, and multi-modal cues. Overall, the work advances wildlife video understanding by providing a modular, efficient approach with clear directions for extension to broader ecological data.

Abstract

Encounters between predator and prey play an essential role in ecosystems, but their rarity makes them difficult to detect in video recordings. Although advances in action recognition (AR) and temporal action detection (AD), especially transformer-based models and vision foundation models, have achieved high performance on human action datasets, animal videos remain relatively under-researched. This thesis addresses this gap by proposing the model MARINE, which utilizes motion-based frame selection designed for fast animal actions and DINOv2 feature extraction with a trainable classification head for action recognition. MARINE outperforms VideoMAE in identifying predator attacks in videos of fish, both on a small and specific coral reef dataset (81.53\% against 52.64\% accuracy), and on a subset of the more extensive Animal Kingdom dataset (94.86\% against 83.14\% accuracy). In a multi-label setting on a representative sample of Animal Kingdom, MARINE achieves 23.79\% mAP, positioning it mid-field among existing benchmarks. Furthermore, in an AD task on the coral reef dataset, MARINE achieves 80.78\% AP (against VideoMAE's 34.89\%) although at a lowered t-IoU threshold of 25\%. Therefore, despite room for improvement, MARINE offers an effective starter framework to apply to AR and AD tasks on animal recordings and thus contribute to the study of natural ecosystems.
Paper Structure (45 sections, 2 equations, 5 figures, 11 tables)

This paper contains 45 sections, 2 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Processing the original coral reef dataset. Each of the 44 10-second-long videos are separated into 5 clips, with the middle one containing a predator attack. The resulting 2-second-long clips (220 in total) will comprise the processed coral reef dataset, where the clips originally situated at the centre of the uncut videos receive the positive label and all others the negative label.
  • Figure 2: Action detection pipeline. The untrimmed video is separated into shorter clips, which are passed to the action recognition model. Combining the predictions of this model and the timestamps which locate the clips in the original video, the temporal location of the predator attack can be predicted in the original footage.
  • Figure 3: ROC curves of MARINE G14 on the coral reef and AK fish datasets, without registers in the DINOv2 backbone. The higher AUC value for the AK fish dataset is consistent across all configurations of the MARINE model.
  • Figure 4: Frequency of predation-related actions labels in the AK fish dataset. One video may potentially include several of these actions.
  • Figure 5: ROC curves of MARINE G14 with and without the use of registers, on both the coral reef (orange curve) and AK fish (blue curve) datasets.