Table of Contents
Fetching ...

Rashomon Sets and Model Multiplicity in Federated Learning

Xenia Heilmann, Luca Corbucci, Mattia Cerrato

TL;DR

This paper tackles the problem of model multiplicity in Federated Learning by formalizing Rashomon sets in a decentralized setting. It introduces three FL-specific Rashomon definitions—global, $t$-agreement, and individual—and develops predictive multiplicity metrics compatible with FL’s privacy constraints, including score-based and decision-based measures. An end-to-end multiplicity-aware FL pipeline is proposed, and an empirical study on Dutch Census, ACS Income, and MNIST demonstrates how these definitions reveal client-specific differences in performance, fairness, and robustness. The findings show that relying on a single global model can mask meaningful heterogeneity across clients, motivating personalized model selection and fairness considerations in FL.

Abstract

The Rashomon set captures the collection of models that achieve near-identical empirical performance yet may differ substantially in their decision boundaries. Understanding the differences among these models, i.e., their multiplicity, is recognized as a crucial step toward model transparency, fairness, and robustness, as it reveals decision boundaries instabilities that standard metrics obscure. However, the existing definitions of Rashomon set and multiplicity metrics assume centralized learning and do not extend naturally to decentralized, multi-party settings like Federated Learning (FL). In FL, multiple clients collaboratively train models under a central server's coordination without sharing raw data, which preserves privacy but introduces challenges from heterogeneous client data distribution and communication constraints. In this setting, the choice of a single best model may homogenize predictive behavior across diverse clients, amplify biases, or undermine fairness guarantees. In this work, we provide the first formalization of Rashomon sets in FL.First, we adapt the Rashomon set definition to FL, distinguishing among three perspectives: (I) a global Rashomon set defined over aggregated statistics across all clients, (II) a t-agreement Rashomon set representing the intersection of local Rashomon sets across a fraction t of clients, and (III) individual Rashomon sets specific to each client's local distribution.Second, we show how standard multiplicity metrics can be estimated under FL's privacy constraints. Finally, we introduce a multiplicity-aware FL pipeline and conduct an empirical study on standard FL benchmark datasets. Our results demonstrate that all three proposed federated Rashomon set definitions offer valuable insights, enabling clients to deploy models that better align with their local data, fairness considerations, and practical requirements.

Rashomon Sets and Model Multiplicity in Federated Learning

TL;DR

This paper tackles the problem of model multiplicity in Federated Learning by formalizing Rashomon sets in a decentralized setting. It introduces three FL-specific Rashomon definitions—global, -agreement, and individual—and develops predictive multiplicity metrics compatible with FL’s privacy constraints, including score-based and decision-based measures. An end-to-end multiplicity-aware FL pipeline is proposed, and an empirical study on Dutch Census, ACS Income, and MNIST demonstrates how these definitions reveal client-specific differences in performance, fairness, and robustness. The findings show that relying on a single global model can mask meaningful heterogeneity across clients, motivating personalized model selection and fairness considerations in FL.

Abstract

The Rashomon set captures the collection of models that achieve near-identical empirical performance yet may differ substantially in their decision boundaries. Understanding the differences among these models, i.e., their multiplicity, is recognized as a crucial step toward model transparency, fairness, and robustness, as it reveals decision boundaries instabilities that standard metrics obscure. However, the existing definitions of Rashomon set and multiplicity metrics assume centralized learning and do not extend naturally to decentralized, multi-party settings like Federated Learning (FL). In FL, multiple clients collaboratively train models under a central server's coordination without sharing raw data, which preserves privacy but introduces challenges from heterogeneous client data distribution and communication constraints. In this setting, the choice of a single best model may homogenize predictive behavior across diverse clients, amplify biases, or undermine fairness guarantees. In this work, we provide the first formalization of Rashomon sets in FL.First, we adapt the Rashomon set definition to FL, distinguishing among three perspectives: (I) a global Rashomon set defined over aggregated statistics across all clients, (II) a t-agreement Rashomon set representing the intersection of local Rashomon sets across a fraction t of clients, and (III) individual Rashomon sets specific to each client's local distribution.Second, we show how standard multiplicity metrics can be estimated under FL's privacy constraints. Finally, we introduce a multiplicity-aware FL pipeline and conduct an empirical study on standard FL benchmark datasets. Our results demonstrate that all three proposed federated Rashomon set definitions offer valuable insights, enabling clients to deploy models that better align with their local data, fairness considerations, and practical requirements.
Paper Structure (31 sections, 16 equations, 11 figures, 2 tables)

This paper contains 31 sections, 16 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: A paradigm shift needed in FL: the old "single-best" model hides significant differences in behavior across clients, obscuring perspectives that vary for different subsets of data. Exploring the Rashomon set enables a more accountable and transparent understanding of model performance.
  • Figure 2: Complete FL pipeline for integration of Rashomon sets and multiplicity analysis
  • Figure 3: Comparison of multiplicity metrics on Rashomon sets defined using the $t$-agreement and global definition, with centralized evaluation as baseline for Dutch, ACS Income, and MNIST. Global definition yields consistently higher multiplicity, $t$-agreement sets are smaller, and centralized evaluation shows higher discrepancy. With MNIST, Rashomon ratios and multiplicity metrics are higher due to the complexity of the problem. Disagreement is only calculated for the binary outcomes.
  • Figure 4: Comparison of multiplicity metrics for individual Rashomon sets (10 clients), with the blue shaded area showing the min-max range from Figure \ref{['fig:global']}. Disagreement is only defined for binary outputs. Clients frequently deviate from the global range, indicating that it is essential to incorporate an individual Rashomon set definition for capturing local differences.
  • Figure 5: Multiplicity metrics for global and $t$-agreement Rashomon sets on the Dutch dataset when varying the number of FL clients. The metrics remain consistent across different clients, indicating that the analysis scales. Stricter $t$-agreement thresholds (e.g., $0.9$) fail to produce Rashomon sets for 10, 40, or 50 clients; tight constraints can limit the feasibility in extreme client configurations.
  • ...and 6 more figures

Theorems & Definitions (6)

  • Definition 1: Rashomon Set ganesh2025systemizingmultiplicitycuriouscase
  • Definition 2: Model Multiplicity ganesh2025systemizingmultiplicitycuriouscase
  • Definition 3: Global Rashomon set
  • Definition 4: $t$-agreement Rashomon set
  • Definition 5: Individual Rashomon set
  • Definition 6