Private Federated Multiclass Post-hoc Calibration
Samuel Maddock, Graham Cormode, Carsten Maple
TL;DR
This work tackles the challenge of post-hoc multiclass calibration in Federated Learning under data heterogeneity and privacy constraints. It introduces two main families of calibrators—FedBBQ (multiclass histogram binning with BBQ) and FedTemp (temperature scaling), along with heterogeneity-aware and order-preserving enhancements—and extends them to user-level DP-FL. Through extensive experiments on seven datasets, the authors demonstrate that binning-based methods with weighting excel in standard FL, while temperature scaling provides the best balance under DP with strong privacy guarantees. The results yield practical, dataset-robust recommendations for calibrating federated models, highlighting the trade-offs between calibration accuracy, global model performance, and privacy budgets. Overall, the paper advances the frontier of reliable, private, and scalable calibration for multiclass federated systems.
Abstract
Calibrating machine learning models so that predicted probabilities better reflect the true outcome frequencies is crucial for reliable decision-making across many applications. In Federated Learning (FL), the goal is to train a global model on data which is distributed across multiple clients and cannot be centralized due to privacy concerns. FL is applied in key areas such as healthcare and finance where calibration is strongly required, yet federated private calibration has been largely overlooked. This work introduces the integration of post-hoc model calibration techniques within FL. Specifically, we transfer traditional centralized calibration methods such as histogram binning and temperature scaling into federated environments and define new methods to operate them under strong client heterogeneity. We study (1) a federated setting and (2) a user-level Differential Privacy (DP) setting and demonstrate how both federation and DP impacts calibration accuracy. We propose strategies to mitigate degradation commonly observed under heterogeneity and our findings highlight that our federated temperature scaling works best for DP-FL whereas our weighted binning approach is best when DP is not required.
