Transfer Learning with Self-Supervised Vision Transformers for Snake Identification
Anthony Miyaguchi, Murilo Gustineli, Austin Fischer, Ryan Lundqvist
TL;DR
This work investigates snake species identification from images using Meta's DINOv2 self-supervised vision transformers to generate embeddings for a large, diverse SnakeCLEF 2024 dataset. A linear classifier is trained on the embeddings (with optional 2D DCT features from patch tokens) to predict species, forming an end-to-end transfer-learning pipeline. While the reported Track 1 score of $39.69$ underperformed the top and baseline, the authors observe meaningful clustering in the CLS embeddings and identify segmentation-based strategies (SAM, OWL-ViT, YOLOv8) as promising avenues for significant performance gains. The study highlights the potential of pre-trained vision models for biodiversity monitoring and provides a foundation for future improvements through targeted task-specific modeling and image segmentation.
Abstract
We present our approach for the SnakeCLEF 2024 competition to predict snake species from images. We explore and use Meta's DINOv2 vision transformer model for feature extraction to tackle species' high variability and visual similarity in a dataset of 182,261 images. We perform exploratory analysis on embeddings to understand their structure, and train a linear classifier on the embeddings to predict species. Despite achieving a score of 39.69, our results show promise for DINOv2 embeddings in snake identification. All code for this project is available at https://github.com/dsgt-kaggle-clef/snakeclef-2024.
