Table of Contents
Fetching ...

Zero-shot Shark Tracking and Biometrics from Aerial Imagery

Chinmay K Lalgudi, Mark E Leone, Jaden V Clark, Sergio Madrigal-Mora, Mario Espinoza

TL;DR

This work addresses non-invasive, scalable monitoring of sharks from aerial imagery by introducing FLAIR, a zero-shot framework that combines Segment Anything Model 2 (SAM2) and CLIP to produce accurate shark segmentation masks from drone videos without labeled data or model fine-tuning. FLAIR samples frames, generates masks with SAM2, filters with CLIP prompts, and propagates masks through video with SAM2 Video Prediction, aligning tracks to suppress false positives. The approach enables downstream biometrics like body length and tailbeat frequency with high fidelity, demonstrating competitive segmentation accuracy and dramatic efficiency gains compared to traditional baselines and human-in-the-loop methods. By generalizing to multiple shark species and providing a blueprint for automatic biometrics extraction, FLAIR promises to accelerate ecological insights while reducing the expertise and time required for aerial wildlife analysis.

Abstract

The recent widespread adoption of drones for studying marine animals provides opportunities for deriving biological information from aerial imagery. The large scale of imagery data acquired from drones is well suited for machine learning (ML) analysis. Development of ML models for analyzing marine animal aerial imagery has followed the classical paradigm of training, testing, and deploying a new model for each dataset, requiring significant time, human effort, and ML expertise. We introduce Frame Level ALIgment and tRacking (FLAIR), which leverages the video understanding of Segment Anything Model 2 (SAM2) and the vision-language capabilities of Contrastive Language-Image Pre-training (CLIP). FLAIR takes a drone video as input and outputs segmentation masks of the species of interest across the video. Notably, FLAIR leverages a zero-shot approach, eliminating the need for labeled data, training a new model, or fine-tuning an existing model to generalize to other species. With a dataset of 18,000 drone images of Pacific nurse sharks, we trained state-of-the-art object detection models to compare against FLAIR. We show that FLAIR massively outperforms these object detectors and performs competitively against two human-in-the-loop methods for prompting SAM2, achieving a Dice score of 0.81. FLAIR readily generalizes to other shark species without additional human effort and can be combined with novel heuristics to automatically extract relevant information including length and tailbeat frequency. FLAIR has significant potential to accelerate aerial imagery analysis workflows, requiring markedly less human effort and expertise than traditional machine learning workflows, while achieving superior accuracy. By reducing the effort required for aerial imagery analysis, FLAIR allows scientists to spend more time interpreting results and deriving insights about marine ecosystems.

Zero-shot Shark Tracking and Biometrics from Aerial Imagery

TL;DR

This work addresses non-invasive, scalable monitoring of sharks from aerial imagery by introducing FLAIR, a zero-shot framework that combines Segment Anything Model 2 (SAM2) and CLIP to produce accurate shark segmentation masks from drone videos without labeled data or model fine-tuning. FLAIR samples frames, generates masks with SAM2, filters with CLIP prompts, and propagates masks through video with SAM2 Video Prediction, aligning tracks to suppress false positives. The approach enables downstream biometrics like body length and tailbeat frequency with high fidelity, demonstrating competitive segmentation accuracy and dramatic efficiency gains compared to traditional baselines and human-in-the-loop methods. By generalizing to multiple shark species and providing a blueprint for automatic biometrics extraction, FLAIR promises to accelerate ecological insights while reducing the expertise and time required for aerial wildlife analysis.

Abstract

The recent widespread adoption of drones for studying marine animals provides opportunities for deriving biological information from aerial imagery. The large scale of imagery data acquired from drones is well suited for machine learning (ML) analysis. Development of ML models for analyzing marine animal aerial imagery has followed the classical paradigm of training, testing, and deploying a new model for each dataset, requiring significant time, human effort, and ML expertise. We introduce Frame Level ALIgment and tRacking (FLAIR), which leverages the video understanding of Segment Anything Model 2 (SAM2) and the vision-language capabilities of Contrastive Language-Image Pre-training (CLIP). FLAIR takes a drone video as input and outputs segmentation masks of the species of interest across the video. Notably, FLAIR leverages a zero-shot approach, eliminating the need for labeled data, training a new model, or fine-tuning an existing model to generalize to other species. With a dataset of 18,000 drone images of Pacific nurse sharks, we trained state-of-the-art object detection models to compare against FLAIR. We show that FLAIR massively outperforms these object detectors and performs competitively against two human-in-the-loop methods for prompting SAM2, achieving a Dice score of 0.81. FLAIR readily generalizes to other shark species without additional human effort and can be combined with novel heuristics to automatically extract relevant information including length and tailbeat frequency. FLAIR has significant potential to accelerate aerial imagery analysis workflows, requiring markedly less human effort and expertise than traditional machine learning workflows, while achieving superior accuracy. By reducing the effort required for aerial imagery analysis, FLAIR allows scientists to spend more time interpreting results and deriving insights about marine ecosystems.
Paper Structure (19 sections, 1 equation, 8 figures, 1 table)

This paper contains 19 sections, 1 equation, 8 figures, 1 table.

Figures (8)

  • Figure 1: A map of the sites where drone video data for the Pacific nurse shark dataset was collected in Santa Elena Bay, Guanacaste, Costa Rica (a). Inset map in (b) shows the Santa Elena Bay study area (c), in which the yellow and orange dots represent the pre-planned flight path locations at Sortija Beach and Matapalito Beach, respectively.
  • Figure 2: Automated biometrics workflow with FLAIR. Inputs are italicized and in green, outputs are bolded and in blue. (a) Representative frames from the Pacific nurse shark UAV video dataset. Tests of generalization of FLAIR beyond our study site and selected species was conducted with crowdsourced videos of a white shark (b) and a blacktip reef shark (c).
  • Figure 3: Summary of segmentation methods compared in our experiments for Per-frame Prompting with SAM 2 Mask Generation (a), Object Detection methods paired with SAM 2 Mask Generation (b), Human-in-the-Loop Tracking with SAM 2 Video Prediction (c), and FLAIR (d). Steps requiring human input/effort are shown in red. Effort is shown in the upper right of each figure.
  • Figure 4: Detailed overview of FLAIR architecture. Input video is passed into SAM 2 Mask Generation to segment all objects, which are then filtered by CLIP score given language prompts. Candidate masks are propagated through the video and aligned to eliminate false positives, resulting in accurate tracking of objects of interest.
  • Figure 5: Dice score comparison of Per-frame Prompting + SAM 2, YOLOv8 + SAM 2, DETR + SAM 2, HiL-Tracking + SAM 2 Video, and FLAIR on 2 unseen holdout videos of nurse sharks. Object detector methods have near-zero segmentation accuracy on the second video.
  • ...and 3 more figures