Table of Contents
Fetching ...

Detection and Identification of Penguins Using Appearance and Motion Features

Kasumi Seko, Hiroki Kinoshita, Raj Rajeshwar Malinda, Hiroaki Kawashima

TL;DR

This study proposes a framework that enhances both detection and identification performance by integrating appearance and motion features in YOLO11, and introduces a tracklet-based contrastive learning approach applied after tracking.

Abstract

In animal facilities, continuous surveillance of penguins is essential yet technically challenging due to their homogeneous visual characteristics, rapid and frequent posture changes, and substantial environmental noise such as water reflections. In this study, we propose a framework that enhances both detection and identification performance by integrating appearance and motion features. For detection, we adapted YOLO11 to process consecutive frames to overcome the lack of temporal consistency in single-frame detectors. This approach leverages motion cues to detect targets even when distinct visual features are obscured. Our evaluation shows that fine-tuning the model with two-frame inputs improves mAP@0.5 from 0.922 to 0.933, outperforming the baseline, and successfully recovers individuals that are indistinguishable in static images. For identification, we introduce a tracklet-based contrastive learning approach applied after tracking. Through qualitative visualization, we demonstrate that the method produces coherent feature embeddings, bringing samples from the same individual closer in the feature space, suggesting the potential for mitigating ID switching.

Detection and Identification of Penguins Using Appearance and Motion Features

TL;DR

This study proposes a framework that enhances both detection and identification performance by integrating appearance and motion features in YOLO11, and introduces a tracklet-based contrastive learning approach applied after tracking.

Abstract

In animal facilities, continuous surveillance of penguins is essential yet technically challenging due to their homogeneous visual characteristics, rapid and frequent posture changes, and substantial environmental noise such as water reflections. In this study, we propose a framework that enhances both detection and identification performance by integrating appearance and motion features. For detection, we adapted YOLO11 to process consecutive frames to overcome the lack of temporal consistency in single-frame detectors. This approach leverages motion cues to detect targets even when distinct visual features are obscured. Our evaluation shows that fine-tuning the model with two-frame inputs improves mAP@0.5 from 0.922 to 0.933, outperforming the baseline, and successfully recovers individuals that are indistinguishable in static images. For identification, we introduce a tracklet-based contrastive learning approach applied after tracking. Through qualitative visualization, we demonstrate that the method produces coherent feature embeddings, bringing samples from the same individual closer in the feature space, suggesting the potential for mitigating ID switching.
Paper Structure (24 sections, 5 figures, 5 tables)

This paper contains 24 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Architecture of the proposed detection method
  • Figure 2: Detection results during swimming. Individuals difficult to distinguish in still images (baseline, left) are detected by utilizing video information (proposed method, right).
  • Figure 3: Detection results in background regions unseen during training. Moving individuals were detected in both methods (top), whereas stationary individuals were not detected (bottom), illustrating the contribution of motion cues.
  • Figure 4: Visualization of feature embeddings using t-SNE
  • Figure 5: Grad-CAM visualization examples for IDs 15 and 21