Holistic Evaluation Metrics: Use Case Sensitive Evaluation Metrics for Federated Learning
Yanli Li, Jehad Ibrahim, Huaming Chen, Dong Yuan, Kim-Kwang Raymond Choo
TL;DR
The work tackles the problem of evaluating federated learning (FL) algorithms beyond a single metric by introducing Holistic Evaluation Metrics (HEM), which combines accuracy, convergence, computational efficiency, fairness, and personalization into a use-case-specific index. It defines importance vectors for three representative use cases—IoT, smart devices, and institutions—and demonstrates how the HEM index differentiates FL algorithms and personalized FL (PFL) methods such as MAML and ProtoNet, across these contexts. The study performs experiments on CIFAR-10 with 100 clients, analyzing FedAvg, FedDyn, SCAFFOLD and their personalized variants, and shows that HEM can reveal trade-offs and guide algorithm selection aligned with real-world requirements. Overall, the framework offers a practical, extensible approach to benchmarking FL in deployed settings and suggests avenues for refining use-case specific importance and benchmarking practices.
Abstract
A large number of federated learning (FL) algorithms have been proposed for different applications and from varying perspectives. However, the evaluation of such approaches often relies on a single metric (e.g., accuracy). Such a practice fails to account for the unique demands and diverse requirements of different use cases. Thus, how to comprehensively evaluate an FL algorithm and determine the most suitable candidate for a designated use case remains an open question. To mitigate this research gap, we introduce the Holistic Evaluation Metrics (HEM) for FL in this work. Specifically, we collectively focus on three primary use cases, which are Internet of Things (IoT), smart devices, and institutions. The evaluation metric encompasses various aspects including accuracy, convergence, computational efficiency, fairness, and personalization. We then assign a respective importance vector for each use case, reflecting their distinct performance requirements and priorities. The HEM index is finally generated by integrating these metric components with their respective importance vectors. Through evaluating different FL algorithms in these three prevalent use cases, our experimental results demonstrate that HEM can effectively assess and identify the FL algorithms best suited to particular scenarios. We anticipate this work sheds light on the evaluation process for pragmatic FL algorithms in real-world applications.
