Table of Contents
Fetching ...

Transfer Learning with Self-Supervised Vision Transformers for Snake Identification

Anthony Miyaguchi, Murilo Gustineli, Austin Fischer, Ryan Lundqvist

TL;DR

This work investigates snake species identification from images using Meta's DINOv2 self-supervised vision transformers to generate embeddings for a large, diverse SnakeCLEF 2024 dataset. A linear classifier is trained on the embeddings (with optional 2D DCT features from patch tokens) to predict species, forming an end-to-end transfer-learning pipeline. While the reported Track 1 score of $39.69$ underperformed the top and baseline, the authors observe meaningful clustering in the CLS embeddings and identify segmentation-based strategies (SAM, OWL-ViT, YOLOv8) as promising avenues for significant performance gains. The study highlights the potential of pre-trained vision models for biodiversity monitoring and provides a foundation for future improvements through targeted task-specific modeling and image segmentation.

Abstract

We present our approach for the SnakeCLEF 2024 competition to predict snake species from images. We explore and use Meta's DINOv2 vision transformer model for feature extraction to tackle species' high variability and visual similarity in a dataset of 182,261 images. We perform exploratory analysis on embeddings to understand their structure, and train a linear classifier on the embeddings to predict species. Despite achieving a score of 39.69, our results show promise for DINOv2 embeddings in snake identification. All code for this project is available at https://github.com/dsgt-kaggle-clef/snakeclef-2024.

Transfer Learning with Self-Supervised Vision Transformers for Snake Identification

TL;DR

This work investigates snake species identification from images using Meta's DINOv2 self-supervised vision transformers to generate embeddings for a large, diverse SnakeCLEF 2024 dataset. A linear classifier is trained on the embeddings (with optional 2D DCT features from patch tokens) to predict species, forming an end-to-end transfer-learning pipeline. While the reported Track 1 score of underperformed the top and baseline, the authors observe meaningful clustering in the CLS embeddings and identify segmentation-based strategies (SAM, OWL-ViT, YOLOv8) as promising avenues for significant performance gains. The study highlights the potential of pre-trained vision models for biodiversity monitoring and provides a foundation for future improvements through targeted task-specific modeling and image segmentation.

Abstract

We present our approach for the SnakeCLEF 2024 competition to predict snake species from images. We explore and use Meta's DINOv2 vision transformer model for feature extraction to tackle species' high variability and visual similarity in a dataset of 182,261 images. We perform exploratory analysis on embeddings to understand their structure, and train a linear classifier on the embeddings to predict species. Despite achieving a score of 39.69, our results show promise for DINOv2 embeddings in snake identification. All code for this project is available at https://github.com/dsgt-kaggle-clef/snakeclef-2024.
Paper Structure (16 sections, 7 figures, 1 table)

This paper contains 16 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: UMAP projection of DCT and [CLS] embeddings of the top 5 snake species by image count.
  • Figure 2: A selected subset of snake species with unique features relevant to the task. We extracted the [CLS] token embeddings using DINOv2 and created projections via UMAP. The distribution is expected, with species like Agkistrodon piscivorus and Agkistrodon contortrix being represented similarly. Additionally, species that are more biologically polymorphic and differ in appearance regionally, like Lampropeltis triangulum and Morelia spilota, exhibit a wider spread. In contrast, species that appear more uniform, like Bitis gabonica and Micrurus fulvius, are found in distinct clusters.
  • Figure 3: End-to-end pipeline. The downloading module retrieves the training and test images and the metadata file, storing them in a Google Cloud Storage (GCS) bucket. The preprocessing module converts the images to binary data and writes them as parquet files to GCS. In the modeling module, the base DINOv2 model extracts embeddings from the training and test data, and a linear classifier is trained on the training embeddings. During inference, the trained classifier makes predictions on the test embeddings, formatting the results for leaderboard submission.
  • Figure 4: Unsupervised SAM
  • Figure 5: Manually selected SAM segment
  • ...and 2 more figures