Table of Contents
Fetching ...

Multi-Label Plant Species Classification with Self-Supervised Vision Transformers

Murilo Gustineli, Anthony Miyaguchi, Ian Stalter

TL;DR

This work tackles large-scale multi-label plant species classification by leveraging self-supervised DINOv2 Vision Transformers as generalized feature extractors. It combines base and fine-tuned embeddings with linear classifiers and a tile-based grid inference pipeline to handle high-resolution vegetation plots while addressing dataset scale via Spark and Parquet preprocessing. The approach demonstrates that fine-tuned DINOv2 embeddings, especially when used with grid-based argmax inference, yield competitive Macro F1 and Micro F1 scores, highlighting the value of task-specific fine-tuning and spatial tiling for multi-label recognition. The proposed pipeline offers a scalable, transferable framework for biodiversity image analysis that can extend to other large ecological datasets, with code available for reproduction.

Abstract

We present a transfer learning approach using a self-supervised Vision Transformer (DINOv2) for the PlantCLEF 2024 competition, focusing on the multi-label plant species classification. Our method leverages both base and fine-tuned DINOv2 models to extract generalized feature embeddings. We train classifiers to predict multiple plant species within a single image using these rich embeddings. To address the computational challenges of the large-scale dataset, we employ Spark for distributed data processing, ensuring efficient memory management and processing across a cluster of workers. Our data processing pipeline transforms images into grids of tiles, classifying each tile, and aggregating these predictions into a consolidated set of probabilities. Our results demonstrate the efficacy of combining transfer learning with advanced data processing techniques for multi-label image classification tasks. Our code is available at https://github.com/dsgt-kaggle-clef/plantclef-2024.

Multi-Label Plant Species Classification with Self-Supervised Vision Transformers

TL;DR

This work tackles large-scale multi-label plant species classification by leveraging self-supervised DINOv2 Vision Transformers as generalized feature extractors. It combines base and fine-tuned embeddings with linear classifiers and a tile-based grid inference pipeline to handle high-resolution vegetation plots while addressing dataset scale via Spark and Parquet preprocessing. The approach demonstrates that fine-tuned DINOv2 embeddings, especially when used with grid-based argmax inference, yield competitive Macro F1 and Micro F1 scores, highlighting the value of task-specific fine-tuning and spatial tiling for multi-label recognition. The proposed pipeline offers a scalable, transferable framework for biodiversity image analysis that can extend to other large ecological datasets, with code available for reproduction.

Abstract

We present a transfer learning approach using a self-supervised Vision Transformer (DINOv2) for the PlantCLEF 2024 competition, focusing on the multi-label plant species classification. Our method leverages both base and fine-tuned DINOv2 models to extract generalized feature embeddings. We train classifiers to predict multiple plant species within a single image using these rich embeddings. To address the computational challenges of the large-scale dataset, we employ Spark for distributed data processing, ensuring efficient memory management and processing across a cluster of workers. Our data processing pipeline transforms images into grids of tiles, classifying each tile, and aggregating these predictions into a consolidated set of probabilities. Our results demonstrate the efficacy of combining transfer learning with advanced data processing techniques for multi-label image classification tasks. Our code is available at https://github.com/dsgt-kaggle-clef/plantclef-2024.
Paper Structure (13 sections, 6 equations, 5 figures, 4 tables)

This paper contains 13 sections, 6 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of our proposed transfer learning method. In the modeling pipeline, we extract the DCT coefficient and [CLS] token embeddings from the single-label cropped and resized images using the base or fine-tuned DINOv2 model, and train a classifier on the embeddings. In the inference pipeline, DINOv2 extracts the [CLS] token embedding from each tile or full-image accordingly, followed by the trained classifier performing inference to obtain output species labels.
  • Figure 2: End-to-end pipeline of our proposed solution. The downloading module retrieves the training and test images, along with metadata, and stores them on Google Cloud Storage (GCS). The preprocessing module converts the images to binary data, crops and resizes them to $\mathcal{R}^{128\times 128}$ dimensions, and writes them as parquet files to GCS. The modeling module extracts embeddings using base and fine-tuned DINOv2 models and trains a linear classifier on the training embeddings. During inference, the trained classifier makes predictions on the test embeddings, formatting the results for leaderboard submission.
  • Figure 3: Comparison of original images with $\mathcal{R}^{128 \times 128}$ cropped and resized squared images. The original images have a minimum resolution of 800 pixels on the longest side, allowing for the use of high-resolution classification models and potentially improving the prediction of small plants in large vegetative plots.
  • Figure 4: Comparison of full-image prediction and grid-based image prediction. The left plot shows a typical vegetative plot from the test set, where a botanist recorded 8 species: Cardamine resedifolia L., Festuca airoides Lam., Pilosella breviscapa (DC.) Soják, Lotus alpinus (Ser.) Schleich. ex Ramond, Poa alpina L., Saxifraga moschata Wulfen, Scorzoneroides pyrenaica (Gouan) Holub, and Thymus nervosus J.Gay ex Willk. The right plot illustrates the same image divided into a $3\times 3$ grid, demonstrating the grid-based approach for species classification by processing each tile independently.
  • Figure 5: UMAP projections of the top 5 plant species with the highest number of images. The fine-tuned model's embeddings exhibit better spatial separation, highlighting their effectiveness as feature representations.