Table of Contents
Fetching ...

SCOPE: Semantic Coreset with Orthogonal Projection Embeddings for Federated learning

Md Anwar Hossen, Nathan R. Tallent, Luanzheng Guo, Ali Jannesary

Abstract

Scientific discovery increasingly requires learning on federated datasets, fed by streams from high-resolution instruments, that have extreme class imbalance. Current ML approaches either require impractical data aggregation or fail due to class imbalance. Existing coreset selection methods rely on local heuristics, making them unaware of the global data landscape and prone to sub-optimal and non-representative pruning. To overcome these challenges, we introduce SCOPE (Semantic Coreset using Orthogonal Projection Embeddings for Federated learning), a coreset framework for federated data that filters anomalies and adaptively prunes redundant data to mitigate long-tail skew. By analyzing the latent space distribution, we score each data point using a representation score that measures the reliability of core class features, a diversity score that quantifies the novelty of orthogonal residuals, and a boundary proximity score that indicates similarity to competing classes. Unlike prior methods, SCOPE shares only scalar metrics with a federated server to construct a global consensus, ensuring communication efficiency. Guided by the global consensus, SCOPE dynamically filters local noise and discards redundant samples to counteract global feature skews. Extensive experiments demonstrate that SCOPE yields competitive global accuracy and robust convergence, all while achieving exceptional efficiency with a 128x to 512x reduction in uplink bandwidth, a 7.72x wall-clock acceleration and reduced FLOP and VRAM footprints for local coreset selection.

SCOPE: Semantic Coreset with Orthogonal Projection Embeddings for Federated learning

Abstract

Scientific discovery increasingly requires learning on federated datasets, fed by streams from high-resolution instruments, that have extreme class imbalance. Current ML approaches either require impractical data aggregation or fail due to class imbalance. Existing coreset selection methods rely on local heuristics, making them unaware of the global data landscape and prone to sub-optimal and non-representative pruning. To overcome these challenges, we introduce SCOPE (Semantic Coreset using Orthogonal Projection Embeddings for Federated learning), a coreset framework for federated data that filters anomalies and adaptively prunes redundant data to mitigate long-tail skew. By analyzing the latent space distribution, we score each data point using a representation score that measures the reliability of core class features, a diversity score that quantifies the novelty of orthogonal residuals, and a boundary proximity score that indicates similarity to competing classes. Unlike prior methods, SCOPE shares only scalar metrics with a federated server to construct a global consensus, ensuring communication efficiency. Guided by the global consensus, SCOPE dynamically filters local noise and discards redundant samples to counteract global feature skews. Extensive experiments demonstrate that SCOPE yields competitive global accuracy and robust convergence, all while achieving exceptional efficiency with a 128x to 512x reduction in uplink bandwidth, a 7.72x wall-clock acceleration and reduced FLOP and VRAM footprints for local coreset selection.
Paper Structure (33 sections, 3 theorems, 23 equations, 6 figures, 12 tables, 1 algorithm)

This paper contains 33 sections, 3 theorems, 23 equations, 6 figures, 12 tables, 1 algorithm.

Key Result

Lemma 1

Given a raw data gradient bias $\beta_{noise}$, removing the top $p_l$ fraction of samples sorted by the Semantic Anomaly Score $AS_i$ bounds the residual approximation error $\epsilon^2 \ll \beta_{noise}^2$. By directly targeting semantic contradictions, $\nabla \tilde{F}_k[w]$ becomes a superior e

Figures (6)

  • Figure 1: (a) A system managing massive data volumes with highly skewed, non-IID, and imbalanced data distributions across diverse edge nodes. (b) Mean Top-1 accuracy ($\%$) of federated data pruning methods averaged across pruning rates ${0.1, 0.3, 0.5, 0.7, 0.9}$. Error bars indicate standard deviation across pruning rates and capture pruning-rate sensitivity. Baselines show a wide bar and are highly sensitive to the pruning rate, whereas SCOPE (ours) shows a relatively narrow bar and is more robust and predictable. The dashed line marks training accuracy on full local dataset
  • Figure 2: Overview of the SCOPE framework. (a) Clients extract $RS$, $DS$, and $S_{neg}$ scalars using a zero-shot MobileCLIP-S2 projection and send class-centered representations, (b) the server aggregates these into a Global Profile. Guided by this framework, clients implement a two-stage pruning mechanism: (c) a Consensus Filter to eliminate semantic anomalies, and (d) Dynamic Balancing to discard redundant data by synchronizing local boundary complexity with the global consensus. (e) produces a refined, balanced coreset for accelerated federated training.
  • Figure 3: Evaluation of the SCOPE pruning process. (a) The stacked bar chart reveals the distinct roles of the two filters: the Balancing Filter primarily targets redundancy in the majority(head) classes and preserving the tailed classes, while the Consensus Filter removes semantic outliers across the spectrum. (b) Scatter plot of Representation Score (RS) versus Boundary Proximity ($S_{\text{neg}}$). The dashed line marks the decision boundary $RS = S_{\text{neg}}$. The shaded Anomaly Zone ($S_{\text{neg}} > RS$) highlights samples where negative class proximity exceeds true class representation.
  • Figure 4: Impact of $\beta$ across different pruning severities $p_f$. In highly constrained scenarios, such as aggressive pruning $p_f=0.9$ or severe imbalance IR=10, SCOPE exhibits a distinct inverted U trajectory, consistently peaking at the balanced threshold of $\beta=0.5$.
  • Figure 5: Semantic Integrity for CIFAR-10. (a) The process identifies sparse outliers and dense redundant samples near cluster centroids. (b) The final coreset retains high-quality samples (circles) that preserve the geometric support of Global Class Centers (Stars), sharpening class separability.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Lemma 1: Gradient Bias Reduction via Anomaly Pruning
  • Lemma 2: Client Drift Reduction via Boundary Alignment
  • Theorem 1: Nonconvex Convergence Guarantee