Table of Contents
Fetching ...

ActiveSSF: An Active-Learning-Guided Self-Supervised Framework for Long-Tailed Megakaryocyte Classification

Linghao Zhuang, Ying Zhang, Gege Yuan, Xingyue Zhao, Zhiping Jiang

TL;DR

ActiveSSF tackles megakaryocyte classification under background noise, long-tail subtype distributions, and morphological variability by coupling clinical-prior cell-region filtering with adaptive, prototype-guided sample selection in self-supervised pretraining. The two-stage pipeline extracts informative cellular regions and builds robust prototypes from labeled data to steer unlabeled data selection, using dynamic density-aware thresholds to emphasize rare subtypes. Across eleven megakaryocyte subtypes on a clinical dataset, ActiveSSF yields state-of-the-art results and substantial gains for rare classes, demonstrating improved diagnostic potential for myelodysplastic syndrome. The framework's integration of region filtering, prototype clustering, and adaptive sampling offers a practical path toward scalable, accurate automated blood-cell analysis in clinical settings.

Abstract

Precise classification of megakaryocytes is crucial for diagnosing myelodysplastic syndromes. Although self-supervised learning has shown promise in medical image analysis, its application to classifying megakaryocytes in stained slides faces three main challenges: (1) pervasive background noise that obscures cellular details, (2) a long-tailed distribution that limits data for rare subtypes, and (3) complex morphological variations leading to high intra-class variability. To address these issues, we propose the ActiveSSF framework, which integrates active learning with self-supervised pretraining. Specifically, our approach employs Gaussian filtering combined with K-means clustering and HSV analysis (augmented by clinical prior knowledge) for accurate region-of-interest extraction; an adaptive sample selection mechanism that dynamically adjusts similarity thresholds to mitigate class imbalance; and prototype clustering on labeled samples to overcome morphological complexity. Experimental results on clinical megakaryocyte datasets demonstrate that ActiveSSF not only achieves state-of-the-art performance but also significantly improves recognition accuracy for rare subtypes. Moreover, the integration of these advanced techniques further underscores the practical potential of ActiveSSF in clinical settings.

ActiveSSF: An Active-Learning-Guided Self-Supervised Framework for Long-Tailed Megakaryocyte Classification

TL;DR

ActiveSSF tackles megakaryocyte classification under background noise, long-tail subtype distributions, and morphological variability by coupling clinical-prior cell-region filtering with adaptive, prototype-guided sample selection in self-supervised pretraining. The two-stage pipeline extracts informative cellular regions and builds robust prototypes from labeled data to steer unlabeled data selection, using dynamic density-aware thresholds to emphasize rare subtypes. Across eleven megakaryocyte subtypes on a clinical dataset, ActiveSSF yields state-of-the-art results and substantial gains for rare classes, demonstrating improved diagnostic potential for myelodysplastic syndrome. The framework's integration of region filtering, prototype clustering, and adaptive sampling offers a practical path toward scalable, accurate automated blood-cell analysis in clinical settings.

Abstract

Precise classification of megakaryocytes is crucial for diagnosing myelodysplastic syndromes. Although self-supervised learning has shown promise in medical image analysis, its application to classifying megakaryocytes in stained slides faces three main challenges: (1) pervasive background noise that obscures cellular details, (2) a long-tailed distribution that limits data for rare subtypes, and (3) complex morphological variations leading to high intra-class variability. To address these issues, we propose the ActiveSSF framework, which integrates active learning with self-supervised pretraining. Specifically, our approach employs Gaussian filtering combined with K-means clustering and HSV analysis (augmented by clinical prior knowledge) for accurate region-of-interest extraction; an adaptive sample selection mechanism that dynamically adjusts similarity thresholds to mitigate class imbalance; and prototype clustering on labeled samples to overcome morphological complexity. Experimental results on clinical megakaryocyte datasets demonstrate that ActiveSSF not only achieves state-of-the-art performance but also significantly improves recognition accuracy for rare subtypes. Moreover, the integration of these advanced techniques further underscores the practical potential of ActiveSSF in clinical settings.

Paper Structure

This paper contains 15 sections, 9 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of our megakaryocyte dataset. Left: Representative images of different megakaryocyte subtypes. Right: Distribution of megakaryocyte subtypes showing the inherent class imbalance.
  • Figure 2: We present a two-stage active learning framework for guiding self-supervised pretraining. Stage 1 (Cell Region Filtering) applies Gaussian blur and K-means clustering to isolate cellular regions from background noise. Stage 2 (Active Sample Selection) extracts features via ResNet he2016identity, clusters them into K prototypes using K-means macqueen1967some, and establishes dynamic thresholds inversely proportional to cluster size—thereby accommodating rare subtypes with lower thresholds. The framework selects samples based on their distance to cluster centers, retaining only those within their respective thresholds for self-supervised pretraining.
  • Figure 3: Comparison of classification performance across all categories, between MAE and MAE+ActiveSSF: (a) MAE Confusion Matrix, (b) MAE+ActiveSSF Confusion Matrix, and (c) Per-Class PR-AUC score Comparison.