Multi-Label Plant Species Classification with Self-Supervised Vision Transformers
Murilo Gustineli, Anthony Miyaguchi, Ian Stalter
TL;DR
This work tackles large-scale multi-label plant species classification by leveraging self-supervised DINOv2 Vision Transformers as generalized feature extractors. It combines base and fine-tuned embeddings with linear classifiers and a tile-based grid inference pipeline to handle high-resolution vegetation plots while addressing dataset scale via Spark and Parquet preprocessing. The approach demonstrates that fine-tuned DINOv2 embeddings, especially when used with grid-based argmax inference, yield competitive Macro F1 and Micro F1 scores, highlighting the value of task-specific fine-tuning and spatial tiling for multi-label recognition. The proposed pipeline offers a scalable, transferable framework for biodiversity image analysis that can extend to other large ecological datasets, with code available for reproduction.
Abstract
We present a transfer learning approach using a self-supervised Vision Transformer (DINOv2) for the PlantCLEF 2024 competition, focusing on the multi-label plant species classification. Our method leverages both base and fine-tuned DINOv2 models to extract generalized feature embeddings. We train classifiers to predict multiple plant species within a single image using these rich embeddings. To address the computational challenges of the large-scale dataset, we employ Spark for distributed data processing, ensuring efficient memory management and processing across a cluster of workers. Our data processing pipeline transforms images into grids of tiles, classifying each tile, and aggregating these predictions into a consolidated set of probabilities. Our results demonstrate the efficacy of combining transfer learning with advanced data processing techniques for multi-label image classification tasks. Our code is available at https://github.com/dsgt-kaggle-clef/plantclef-2024.
