Table of Contents
Fetching ...

CHASe: Client Heterogeneity-Aware Data Selection for Effective Federated Active Learning

Jun Zhang, Jue Wang, Huan Li, Zhongle Xie, Ke Chen, Lidan Shou

TL;DR

This work tackles federated active learning under client heterogeneity by introducing CHASe, a framework that prioritizes annotation for unlabeled samples with high epistemic variation (EV). It combines EV-based data selection, a boundary-calibration alignment loss, and a data-efficient mechanism (FAmS) to improve both effectiveness and efficiency in non-IID federations. Extensive experiments across image and text datasets, with varied federation settings, show CHASe consistently outperforms strong baselines and ablations, while reducing computational costs. The approach offers practical implications for scalable, privacy-preserving active learning in cross-client environments and provides a solid foundation for future exploration of EV-guided data selection and boundary calibration in federated learning.

Abstract

Active learning (AL) reduces human annotation costs for machine learning systems by strategically selecting the most informative unlabeled data for annotation, but performing it individually may still be insufficient due to restricted data diversity and annotation budget. Federated Active Learning (FAL) addresses this by facilitating collaborative data selection and model training, while preserving the confidentiality of raw data samples. Yet, existing FAL methods fail to account for the heterogeneity of data distribution across clients and the associated fluctuations in global and local model parameters, adversely affecting model accuracy. To overcome these challenges, we propose CHASe (Client Heterogeneity-Aware Data Selection), specifically designed for FAL. CHASe focuses on identifying those unlabeled samples with high epistemic variations (EVs), which notably oscillate around the decision boundaries during training. To achieve both effectiveness and efficiency, \model{} encompasses techniques for 1) tracking EVs by analyzing inference inconsistencies across training epochs, 2) calibrating decision boundaries of inaccurate models with a new alignment loss, and 3) enhancing data selection efficiency via a data freeze and awaken mechanism with subset sampling. Experiments show that CHASe surpasses various established baselines in terms of effectiveness and efficiency, validated across diverse datasets, model complexities, and heterogeneous federation settings.

CHASe: Client Heterogeneity-Aware Data Selection for Effective Federated Active Learning

TL;DR

This work tackles federated active learning under client heterogeneity by introducing CHASe, a framework that prioritizes annotation for unlabeled samples with high epistemic variation (EV). It combines EV-based data selection, a boundary-calibration alignment loss, and a data-efficient mechanism (FAmS) to improve both effectiveness and efficiency in non-IID federations. Extensive experiments across image and text datasets, with varied federation settings, show CHASe consistently outperforms strong baselines and ablations, while reducing computational costs. The approach offers practical implications for scalable, privacy-preserving active learning in cross-client environments and provides a solid foundation for future exploration of EV-guided data selection and boundary calibration in federated learning.

Abstract

Active learning (AL) reduces human annotation costs for machine learning systems by strategically selecting the most informative unlabeled data for annotation, but performing it individually may still be insufficient due to restricted data diversity and annotation budget. Federated Active Learning (FAL) addresses this by facilitating collaborative data selection and model training, while preserving the confidentiality of raw data samples. Yet, existing FAL methods fail to account for the heterogeneity of data distribution across clients and the associated fluctuations in global and local model parameters, adversely affecting model accuracy. To overcome these challenges, we propose CHASe (Client Heterogeneity-Aware Data Selection), specifically designed for FAL. CHASe focuses on identifying those unlabeled samples with high epistemic variations (EVs), which notably oscillate around the decision boundaries during training. To achieve both effectiveness and efficiency, \model{} encompasses techniques for 1) tracking EVs by analyzing inference inconsistencies across training epochs, 2) calibrating decision boundaries of inaccurate models with a new alignment loss, and 3) enhancing data selection efficiency via a data freeze and awaken mechanism with subset sampling. Experiments show that CHASe surpasses various established baselines in terms of effectiveness and efficiency, validated across diverse datasets, model complexities, and heterogeneous federation settings.

Paper Structure

This paper contains 45 sections, 14 equations, 18 figures, 7 tables, 2 algorithms.

Figures (18)

  • Figure 1: Visualization of samples' epistemic variation in FAL.
  • Figure 2: The workflow and integrated techniques of CHASe.
  • Figure 3: Example of quantifying the EV of an unlabled sample. For historical inference results ['dog', 'cat', 'cat', 'zebra', 'cat'], $\boldsymbol{V}=[0,1,0,1,1]$ and EV is calculated as $||\boldsymbol{V}||_0=3$.
  • Figure 4: Example of the align loss term computation for an unlabeled sample.
  • Figure 5: An illustration of FAmS.
  • ...and 13 more figures