Table of Contents
Fetching ...

Self-Supervised Federated Learning under Data Heterogeneity for Label-Scarce Diatom Classification

Mingkun Tan, Xilu Wang, Michael Kloster, Tim W. Nattkemper

Abstract

Label-scarce visual classification under decentralized and heterogeneous data is a fundamental challenge in pattern recognition, especially when sites exhibit partially overlapping class sets. While self-supervised federated learning (SSFL) offers a promising solution, existing studies commonly assume the same data heterogeneity pattern throughout pre-training and fine-tuning. Moreover, current partitioning schemes often fail to generate pure partially class-disjoint data settings, limiting controllable simulation of real-world label-space heterogeneity. In this work, we introduce SSFL for diatom classification as a representative real-world instance and systematically investigate stage-specific data heterogeneity. We study cross-site variation in unlabeled data volume during pre-training and label-space misalignment during downstream fine-tuning. To study the latter in a controllable setting, we propose PreDi, a partitioning scheme that disentangles label-space heterogeneity into two orthogonal dimensions, namely class Prevalence and class-set size Disparity, enabling separate analysis of their effects. Guided by the resulting insights, we further propose PreP-WFL (Prevalence-based Personalized Weighted Federated Learning) to adaptively strengthen rare-class representations in low-prevalence scenarios. Extensive experiments show that SSFL consistently outperforms local-only training under both homogeneous and heterogeneous settings. The pronounced heterogeneity in unlabeled data volume is associated with improved representation pre-training, whereas under label-space heterogeneity, prevalence dominates performance and disparity has a smaller effect. PreP-WFL effectively mitigates this degradation, with gains increasing as prevalence decreases. These findings provide a mechanistic basis for characterizing label-space heterogeneity in decentralized recognition systems.

Self-Supervised Federated Learning under Data Heterogeneity for Label-Scarce Diatom Classification

Abstract

Label-scarce visual classification under decentralized and heterogeneous data is a fundamental challenge in pattern recognition, especially when sites exhibit partially overlapping class sets. While self-supervised federated learning (SSFL) offers a promising solution, existing studies commonly assume the same data heterogeneity pattern throughout pre-training and fine-tuning. Moreover, current partitioning schemes often fail to generate pure partially class-disjoint data settings, limiting controllable simulation of real-world label-space heterogeneity. In this work, we introduce SSFL for diatom classification as a representative real-world instance and systematically investigate stage-specific data heterogeneity. We study cross-site variation in unlabeled data volume during pre-training and label-space misalignment during downstream fine-tuning. To study the latter in a controllable setting, we propose PreDi, a partitioning scheme that disentangles label-space heterogeneity into two orthogonal dimensions, namely class Prevalence and class-set size Disparity, enabling separate analysis of their effects. Guided by the resulting insights, we further propose PreP-WFL (Prevalence-based Personalized Weighted Federated Learning) to adaptively strengthen rare-class representations in low-prevalence scenarios. Extensive experiments show that SSFL consistently outperforms local-only training under both homogeneous and heterogeneous settings. The pronounced heterogeneity in unlabeled data volume is associated with improved representation pre-training, whereas under label-space heterogeneity, prevalence dominates performance and disparity has a smaller effect. PreP-WFL effectively mitigates this degradation, with gains increasing as prevalence decreases. These findings provide a mechanistic basis for characterizing label-space heterogeneity in decentralized recognition systems.

Paper Structure

This paper contains 28 sections, 10 equations, 9 figures, 11 tables, 2 algorithms.

Figures (9)

  • Figure 1: Overview of the self-supervised federated learning framework with PreP-WFL. In the pre-training stage, masked image reconstruction is adopted as the self-supervised task: at each communication round $t$, every client $k\in\{1,\dots,K\}$ updates its local encoder $f_{\theta_k}$ and decoder $g_{\phi_k}$ on the unlabeled dataset $\mathcal{D}^u_k$, and uploads the parameters $(\theta^t_k,\phi^t_k)$ to the server. The server performs FedAvg to obtain a global encoder $(f_{\theta^{(*)}},g_{\phi^{(*)}})$ and broadcasts it to all clients. In the fine-tuning stage, the final global encoder $f_{\theta^{(*)}}$ initializes the local encoders, and a classifier head $h_{\psi_k}$ is attached on each client and trained on its labeled dataset $\mathcal{D}^\ell_k$. The PreP-weight module (red box) performs a one-shot aggregation of the clients’ label sets ${\mathcal{C}_k}$(label IDs only, no images) to compute the global class prevalence $\rho_c$ and the corresponding weights $w_c$. These weights are then broadcast to all clients and incorporated into a weighted cross-entropy loss during local fine-tuning, thereby emphasizing rare, site-specific taxa in the federated model.
  • Figure 2: Overview of the dataset reconstruction pipeline. From the original dataset, we first derive the filtered labeled set $\mathcal{D}^{f}$, which is partitioned into training and test sets $\mathcal{D}^{\mathrm{train}}$ and $\mathcal{D}^{\mathrm{test}}$ ($80/20$). The remaining taxa form $\mathcal{D}^r$. From $\mathcal{D}^{\mathrm{train}}$, we sample balanced (50 images per class) labeled subsets $\{\mathcal{D}^{\ell}_k\}_{k=1}^{4}$ to serve as the labeled dataset on each client for downstream fine-tuning. The unlabeled pool $\mathcal{D}^{u}$ is constructed by combining $\mathcal{D}^{\mathrm{train}}$ with $\mathcal{D}^r$ and discarding all labels.
  • Figure 3: Heterogeneity in unlabeled data volume. Bars show the number of unlabeled samples per client for the IID Split$^{u}_{\mathrm{IID}}$ and the four non-IID partitions Split$^{u}_{1}$–Split$^{u}_{4}$ (corresponding to $\alpha\!\in\!\{1.0,0.5,0.2,0.1\}$). See also Table \ref{['tab:pretrain_splits']}.
  • Figure 4: Illustration of class prevalence and class-set size disparity. The x-axis indicates prevalence (low to high), the y-axis indicates disparity (low to high), and each colored circle denotes a class (A--D).
  • Figure 5: Heatmap of the number of classes per client under different labeled-data splits. The x-axis enumerates the IID split $\text{Split}^{\ell}_{\mathrm{IID}}$ and PreDi non-IID splits $\text{Split}^{\ell}_{\bar{\rho}^*,\sigma^*}$ obtained by varying $\bar{\rho}^* \in \{3.5, 3.0, 2.5, 2.0, 1.5\}$ and $\sigma^* \in \{0.0, 1.0, 2.0, 3.0\}$. The y-axis lists the clients, and each cell indicates the number of classes assigned to that client under the corresponding split.
  • ...and 4 more figures