Self-Supervised Pretraining for Fine-Grained Plankton Recognition
Joona Kareinen, Tuomas Eerola, Kaisa Kraft, Lasse Lensu, Sanna Suikkanen, Heikki Kälviäinen
TL;DR
This work tackles the challenge of fine-grained plankton recognition under dataset shifts caused by diverse imaging instruments and evolving taxonomies. It adopts Masked Autoencoder (MAE) self-supervised pretraining on a large, heterogeneous plankton dataset, followed by supervised fine-tuning with limited labeled data. By comparing pretraining on ImageNet, diverse plankton data with and without target-domain content, the study shows domain-specific SSL markedly improves performance in low-label regimes and when unlabeled target data is accessible during pretraining. The contributions include the first MAE application to plankton data, a thorough evaluation of pretraining strategies, and publicly available pretrained models, which collectively reduce labeling requirements for ecological monitoring and enable better cross-dataset generalization.
Abstract
Plankton recognition is an important computer vision problem due to plankton's essential role in ocean food webs and carbon capture, highlighting the need for species-level monitoring. However, this task is challenging due to its fine-grained nature and dataset shifts caused by different imaging instruments and varying species distributions. As new plankton image datasets are collected at an increasing pace, there is a need for general plankton recognition models that require minimal expert effort for data labeling. In this work, we study large-scale self-supervised pretraining for fine-grained plankton recognition. We first employ masked autoencoding and a large volume of diverse plankton image data to pretrain a general-purpose plankton image encoder. Then we utilize fine-tuning to obtain accurate plankton recognition models for new datasets with a very limited number of labeled training images. Our experiments show that self-supervised pretraining with diverse plankton data clearly increases plankton recognition accuracy compared to standard ImageNet pretraining when the amount of training data is limited. Moreover, the accuracy can be further improved when unlabeled target data is available and utilized during the pretraining.
