Rashomon Sets and Model Multiplicity in Federated Learning
Xenia Heilmann, Luca Corbucci, Mattia Cerrato
TL;DR
This paper tackles the problem of model multiplicity in Federated Learning by formalizing Rashomon sets in a decentralized setting. It introduces three FL-specific Rashomon definitions—global, $t$-agreement, and individual—and develops predictive multiplicity metrics compatible with FL’s privacy constraints, including score-based and decision-based measures. An end-to-end multiplicity-aware FL pipeline is proposed, and an empirical study on Dutch Census, ACS Income, and MNIST demonstrates how these definitions reveal client-specific differences in performance, fairness, and robustness. The findings show that relying on a single global model can mask meaningful heterogeneity across clients, motivating personalized model selection and fairness considerations in FL.
Abstract
The Rashomon set captures the collection of models that achieve near-identical empirical performance yet may differ substantially in their decision boundaries. Understanding the differences among these models, i.e., their multiplicity, is recognized as a crucial step toward model transparency, fairness, and robustness, as it reveals decision boundaries instabilities that standard metrics obscure. However, the existing definitions of Rashomon set and multiplicity metrics assume centralized learning and do not extend naturally to decentralized, multi-party settings like Federated Learning (FL). In FL, multiple clients collaboratively train models under a central server's coordination without sharing raw data, which preserves privacy but introduces challenges from heterogeneous client data distribution and communication constraints. In this setting, the choice of a single best model may homogenize predictive behavior across diverse clients, amplify biases, or undermine fairness guarantees. In this work, we provide the first formalization of Rashomon sets in FL.First, we adapt the Rashomon set definition to FL, distinguishing among three perspectives: (I) a global Rashomon set defined over aggregated statistics across all clients, (II) a t-agreement Rashomon set representing the intersection of local Rashomon sets across a fraction t of clients, and (III) individual Rashomon sets specific to each client's local distribution.Second, we show how standard multiplicity metrics can be estimated under FL's privacy constraints. Finally, we introduce a multiplicity-aware FL pipeline and conduct an empirical study on standard FL benchmark datasets. Our results demonstrate that all three proposed federated Rashomon set definitions offer valuable insights, enabling clients to deploy models that better align with their local data, fairness considerations, and practical requirements.
