Table of Contents
Fetching ...

Private Federated Multiclass Post-hoc Calibration

Samuel Maddock, Graham Cormode, Carsten Maple

TL;DR

This work tackles the challenge of post-hoc multiclass calibration in Federated Learning under data heterogeneity and privacy constraints. It introduces two main families of calibrators—FedBBQ (multiclass histogram binning with BBQ) and FedTemp (temperature scaling), along with heterogeneity-aware and order-preserving enhancements—and extends them to user-level DP-FL. Through extensive experiments on seven datasets, the authors demonstrate that binning-based methods with weighting excel in standard FL, while temperature scaling provides the best balance under DP with strong privacy guarantees. The results yield practical, dataset-robust recommendations for calibrating federated models, highlighting the trade-offs between calibration accuracy, global model performance, and privacy budgets. Overall, the paper advances the frontier of reliable, private, and scalable calibration for multiclass federated systems.

Abstract

Calibrating machine learning models so that predicted probabilities better reflect the true outcome frequencies is crucial for reliable decision-making across many applications. In Federated Learning (FL), the goal is to train a global model on data which is distributed across multiple clients and cannot be centralized due to privacy concerns. FL is applied in key areas such as healthcare and finance where calibration is strongly required, yet federated private calibration has been largely overlooked. This work introduces the integration of post-hoc model calibration techniques within FL. Specifically, we transfer traditional centralized calibration methods such as histogram binning and temperature scaling into federated environments and define new methods to operate them under strong client heterogeneity. We study (1) a federated setting and (2) a user-level Differential Privacy (DP) setting and demonstrate how both federation and DP impacts calibration accuracy. We propose strategies to mitigate degradation commonly observed under heterogeneity and our findings highlight that our federated temperature scaling works best for DP-FL whereas our weighted binning approach is best when DP is not required.

Private Federated Multiclass Post-hoc Calibration

TL;DR

This work tackles the challenge of post-hoc multiclass calibration in Federated Learning under data heterogeneity and privacy constraints. It introduces two main families of calibrators—FedBBQ (multiclass histogram binning with BBQ) and FedTemp (temperature scaling), along with heterogeneity-aware and order-preserving enhancements—and extends them to user-level DP-FL. Through extensive experiments on seven datasets, the authors demonstrate that binning-based methods with weighting excel in standard FL, while temperature scaling provides the best balance under DP with strong privacy guarantees. The results yield practical, dataset-robust recommendations for calibrating federated models, highlighting the trade-offs between calibration accuracy, global model performance, and privacy budgets. Overall, the paper advances the frontier of reliable, private, and scalable calibration for multiclass federated systems.

Abstract

Calibrating machine learning models so that predicted probabilities better reflect the true outcome frequencies is crucial for reliable decision-making across many applications. In Federated Learning (FL), the goal is to train a global model on data which is distributed across multiple clients and cannot be centralized due to privacy concerns. FL is applied in key areas such as healthcare and finance where calibration is strongly required, yet federated private calibration has been largely overlooked. This work introduces the integration of post-hoc model calibration techniques within FL. Specifically, we transfer traditional centralized calibration methods such as histogram binning and temperature scaling into federated environments and define new methods to operate them under strong client heterogeneity. We study (1) a federated setting and (2) a user-level Differential Privacy (DP) setting and demonstrate how both federation and DP impacts calibration accuracy. We propose strategies to mitigate degradation commonly observed under heterogeneity and our findings highlight that our federated temperature scaling works best for DP-FL whereas our weighted binning approach is best when DP is not required.

Paper Structure

This paper contains 50 sections, 3 theorems, 11 equations, 11 figures, 7 tables, 2 algorithms.

Key Result

Lemma B.3

If a mechanism $\mathcal{M}$ satisfies $\rho$-zCDP then it satisfies $(\varepsilon,\delta)$-DP for all $\varepsilon > 0$ with

Figures (11)

  • Figure 1: Naive federated calibration on CIFAR10 (Simple CNN), varying heterogeneity $\beta$.
  • Figure 2: FL Calibration on CIFAR10 (Simple CNN), $\beta=0.1$ unless otherwise stated.
  • Figure 3: DP-FL Calibration on CIFAR10 (Simple CNN) $\beta=0.1$ varying $\varepsilon$ with $\delta=10^{-5}$
  • Figure 4: Federated Calibration via FedBBQ on CIFAR100 whilst varying the bin parameter $M$. This controls the total bins for the BBQ histogram as $B = 2^M$ total bins.
  • Figure 5: FL Calibration on CIFAR100 (Simple CNN), $\beta=0.1$ unless otherwise stated.
  • ...and 6 more figures

Theorems & Definitions (12)

  • Definition 3.1: $(\varepsilon, \delta)$-DP
  • Definition 3.2: Perfect Calibration
  • Definition 3.3: Classwise-ECE
  • Definition B.1: Differential Privacy dwork2014foundations
  • Definition B.2: $\rho$-zCDP
  • Lemma B.3: zCDP to DP canonne2020discrete
  • Definition B.4: $L_2$ Sensitivity
  • Definition B.5: Gaussian Mechanism, GM
  • Lemma B.6: zCDP composition bun2016concentrated
  • Lemma B.7: Noise calibration for $(\varepsilon, \delta)$-DP federated calibration
  • ...and 2 more