SeaTurtleID2022: A long-span dataset for reliable sea turtle re-identification

Lukáš Adam; Vojtěch Čermák; Kostas Papafitsoros; Lukáš Picek

SeaTurtleID2022: A long-span dataset for reliable sea turtle re-identification

Lukáš Adam, Vojtěch Čermák, Kostas Papafitsoros, Lukáš Picek

TL;DR

SeaTurtleID2022 delivers the longest-spanned public wild animal image dataset to date for sea turtle re-identification, with $8729$ photographs of $438$ individuals across $13$ years and timestamp annotations. It demonstrates that time-aware splits are essential to avoid overestimation from random splits and provides both closed-set and open-set benchmarks, including body-part subsets. Baseline benchmarks show strong improvements from deep metric learning (ArcFace with Swin-B) and a practical end-to-end system that achieves $86.8\%$ accuracy, significantly outperforming naive full-image approaches. The dataset supports multiple vision tasks beyond re-id and emphasizes the importance of including time information for realistic ecological evaluation and long-term monitoring.

Abstract

This paper introduces the first public large-scale, long-span dataset with sea turtle photographs captured in the wild -- \href{https://www.kaggle.com/datasets/wildlifedatasets/seaturtleid2022}{SeaTurtleID2022}. The dataset contains 8729 photographs of 438 unique individuals collected within 13 years, making it the longest-spanned dataset for animal re-identification. All photographs include various annotations, e.g., identity, encounter timestamp, and body parts segmentation masks. Instead of standard "random" splits, the dataset allows for two realistic and ecologically motivated splits: (i) a \textit{time-aware closed-set} with training, validation, and test data from different days/years, and (ii) a \textit{time-aware open-set} with new unknown individuals in test and validation sets. We show that time-aware splits are essential for benchmarking re-identification methods, as random splits lead to performance overestimation. Furthermore, a baseline instance segmentation and re-identification performance over various body parts is provided. Finally, an end-to-end system for sea turtle re-identification is proposed and evaluated. The proposed system based on Hybrid Task Cascade for head instance segmentation and ArcFace-trained feature-extractor achieved an accuracy of 86.8\%.

SeaTurtleID2022: A long-span dataset for reliable sea turtle re-identification

TL;DR

SeaTurtleID2022 delivers the longest-spanned public wild animal image dataset to date for sea turtle re-identification, with

photographs of

individuals across

years and timestamp annotations. It demonstrates that time-aware splits are essential to avoid overestimation from random splits and provides both closed-set and open-set benchmarks, including body-part subsets. Baseline benchmarks show strong improvements from deep metric learning (ArcFace with Swin-B) and a practical end-to-end system that achieves

accuracy, significantly outperforming naive full-image approaches. The dataset supports multiple vision tasks beyond re-id and emphasizes the importance of including time information for realistic ecological evaluation and long-term monitoring.

Abstract

Paper Structure (21 sections, 1 equation, 12 figures, 10 tables)

This paper contains 21 sections, 1 equation, 12 figures, 10 tables.

Introduction
The SeaTurtleID2022 dataset
Data collection
Dataset highlights
Dataset splits and subsets
Sea turtle re-identification baselines
Local feature-based methods
Metric learning
Random vs. time-aware splits
Baseline Results
Random vs time-aware splits
Body-parts segmentation baselines
Recommended end-to-end system
Conclusions
Acknowledgments
...and 6 more sections

Figures (12)

Figure 1: The long-span difference in visual appearance of one individual sea turtle. The shapes of the facial scales remain the same, but other features, e.g., coloration, pigmentation, shape, and scratches, change over time.
Figure 2: Selected individual turtle (t023) from the SeaTurtleID2022 database, photographed with three different camera set-ups. Photographs taken with the DSLR camera are of higher quality, and the additional use of flash recovers the natural colouration of the animal. All the photographs were cropped for illustration purposes.
Figure 3: Number of photographs for each of the 438 turtles. The orange line corresponds to 10 photographs.
Figure 4: Time-related statistics within the SeaTurtleID2022 dataset: number of encounters per year (left), distribution of all individuals to the total number of observation years, i.e., recurrence of individuals (middle), and number of newly observed identities in each year (right).
Figure 5: Examples of body parts (head, carapace, flippers) segmentation masks.
...and 7 more figures

SeaTurtleID2022: A long-span dataset for reliable sea turtle re-identification

TL;DR

Abstract

SeaTurtleID2022: A long-span dataset for reliable sea turtle re-identification

Authors

TL;DR

Abstract

Table of Contents

Figures (12)