Table of Contents
Fetching ...

FedCal: Achieving Local and Global Calibration in Federated Learning via Aggregated Parameterized Scaler

Hongyi Peng, Han Yu, Xiaoli Tang, Xiaoxiao Li

TL;DR

This paper addresses calibration reliability in Federated Learning under non-IID data, proposing FedCal which learns client-specific post-hoc scalers for local calibration and aggregates them into a global scaler to improve global calibration without global validation data. FedCal relies on an MLP-based scaler with order-preserving properties and uses weight matching to aggregate scalers via linear mode connectivity, enabling robust calibration through periodic synchronization with FedAvg. The approach yields substantial improvements in global calibration error across four datasets and varying non-IID levels, achieving up to roughly 63% reduction over unsafeguarded baselines and about 48% over non-ensemble calibration baselines, while maintaining or improving accuracy. This work demonstrates that coordinating local calibration with an aggregatable global calibrator can significantly enhance reliability in FL, with practical implications for high-stakes deployments and potential extensions with privacy-preserving analytics.

Abstract

Federated learning (FL) enables collaborative machine learning across distributed data owners, but data heterogeneity poses a challenge for model calibration. While prior work focused on improving accuracy for non-iid data, calibration remains under-explored. This study reveals existing FL aggregation approaches lead to sub-optimal calibration, and theoretical analysis shows despite constraining variance in clients' label distributions, global calibration error is still asymptotically lower bounded. To address this, we propose a novel Federated Calibration (FedCal) approach, emphasizing both local and global calibration. It leverages client-specific scalers for local calibration to effectively correct output misalignment without sacrificing prediction accuracy. These scalers are then aggregated via weight averaging to generate a global scaler, minimizing the global calibration error. Extensive experiments demonstrate FedCal significantly outperforms the best-performing baseline, reducing global calibration error by 47.66% on average.

FedCal: Achieving Local and Global Calibration in Federated Learning via Aggregated Parameterized Scaler

TL;DR

This paper addresses calibration reliability in Federated Learning under non-IID data, proposing FedCal which learns client-specific post-hoc scalers for local calibration and aggregates them into a global scaler to improve global calibration without global validation data. FedCal relies on an MLP-based scaler with order-preserving properties and uses weight matching to aggregate scalers via linear mode connectivity, enabling robust calibration through periodic synchronization with FedAvg. The approach yields substantial improvements in global calibration error across four datasets and varying non-IID levels, achieving up to roughly 63% reduction over unsafeguarded baselines and about 48% over non-ensemble calibration baselines, while maintaining or improving accuracy. This work demonstrates that coordinating local calibration with an aggregatable global calibrator can significantly enhance reliability in FL, with practical implications for high-stakes deployments and potential extensions with privacy-preserving analytics.

Abstract

Federated learning (FL) enables collaborative machine learning across distributed data owners, but data heterogeneity poses a challenge for model calibration. While prior work focused on improving accuracy for non-iid data, calibration remains under-explored. This study reveals existing FL aggregation approaches lead to sub-optimal calibration, and theoretical analysis shows despite constraining variance in clients' label distributions, global calibration error is still asymptotically lower bounded. To address this, we propose a novel Federated Calibration (FedCal) approach, emphasizing both local and global calibration. It leverages client-specific scalers for local calibration to effectively correct output misalignment without sacrificing prediction accuracy. These scalers are then aggregated via weight averaging to generate a global scaler, minimizing the global calibration error. Extensive experiments demonstrate FedCal significantly outperforms the best-performing baseline, reducing global calibration error by 47.66% on average.
Paper Structure (24 sections, 2 theorems, 31 equations, 5 figures, 3 tables, 2 algorithms)

This paper contains 24 sections, 2 theorems, 31 equations, 5 figures, 3 tables, 2 algorithms.

Key Result

Theorem 4.4

(Lower bound of global calibration error). Consider the scenario where the discrepancy between the local and global label distributions is bounded by $G$, as stated in Assumption ass:distribution_divergence. Let $R$ represent the number of FL communication rounds involving more than two clients. Joi

Figures (5)

  • Figure 1: [Left] Impacts of Data Distribution Discrepancies on Model Calibration in Federated Learning. The presence of non-IID data across local nodes contributes to miscalibration issues in the aggregated model, influencing both local and global datasets. [Right] Impact of data heterogeneity on the accuracy and reliability of FL models. As the degree of non-IIDness (quantified by the Dirichlet distribution parameter $\beta$) increases, both accuracy and reliability of FL models trained on MNIST (MLP, 10 clients) and CIFAR-10 (ResNet-14, 10 clients) using FedAvg McMahan_Moore_Ramage_Hampson_Arcas_2017 deteriorate.
  • Figure 2: Impact of Non-IID Data Distribution on Client and Server Calibration in FL. The top plot shows the calibration error of five clients trained on the MNIST dataset with a Multilayer Perceptron (MLP) model under IID and non-IID distributions. We observe that the non-IID client exhibits significantly higher calibration error compared to the IID clients, and that calibration error can vary significantly across clients and even at the server due to the skewed data distribution. The two bottom plots depict the model reliability for a single client trained under IID and non-IID conditions. The purple dashed line represents the normalized class density of a client, and the gray dashed line represents perfect calibration (i.e., confidence aligns exactly with accuracy). In the IID case (left plot), the model is well-calibrated, with the confidence closely matching the accuracy throughout the range. In contrast, the non-IID case (right plot) reveals severe under-confidence.
  • Figure 3: Impact of the Order-Preserving Network. Without order preservation, $\alpha$ can represent arbitrary mappings, potentially altering the predicted class ordering (highlighted in red). The working principles of order-preserving networks are in Appendix \ref{['example_op']}.
  • Figure 4: Local and global calibration errors as non-IIDness increases. [Top left]: the average local calibration errors. [Bottom left]: the maximum local calibration errors. [Right]: the global calibration error.
  • Figure 5: Local and global ECE vs. aggregation weights.

Theorems & Definitions (7)

  • Definition 3.1
  • Definition 3.2
  • Definition 4.1
  • Definition 4.2
  • Theorem 4.4
  • proof
  • Theorem 1.1