On the Distributed Evaluation of Generative Models
Zixiao Wang, Farzan Farnia, Zhenghao Lin, Yunheng Shen, Bei Yu
TL;DR
The paper addresses evaluating generative models in distributed, non-i.i.d. settings by comparing two aggregation schemes for distance-based metrics: $\mathcal{D}_{\mathrm{all}}$ (centralized reference) and $\mathcal{D}_{\mathrm{avg}}$ (client-averaged). It proves that $\mathrm{KID}_{\mathrm{avg}}$ preserves the ranking of $\mathrm{KID}_{\mathrm{all}}$, making it a robust, communication-efficient proxy for distributed evaluation, while $\mathrm{FID}_{\mathrm{avg}}$ need not preserve the ranking of $\mathrm{FID}_{\mathrm{all}}$ and can misrank models under data heterogeneity. Through extensive experiments on FFHQ, CIFAR, AFHQ, and ImageNet-32, the authors demonstrate consistent KID-based rankings and illustrate scenarios where FID-based rankings diverge, including mode-collapse detection. The work suggests adopting $\mathrm{KID}_{\mathrm{avg}}$ as a privacy-preserving, scalable tool for distributed model evaluation and highlights avenues for extending analysis to other metrics and domains.
Abstract
The evaluation of deep generative models has been extensively studied in the centralized setting, where the reference data are drawn from a single probability distribution. On the other hand, several applications of generative models concern distributed settings, e.g. the federated learning setting, where the reference data for conducting evaluation are provided by several clients in a network. In this paper, we study the evaluation of generative models in such distributed contexts with potentially heterogeneous data distributions across clients. We focus on the widely-used distance-based evaluation metrics, Fréchet Inception Distance (FID) and Kernel Inception Distance (KID). In the case of KID metric, we prove that scoring a group of generative models using the clients' averaged KID score will result in the same ranking as that of a centralized KID evaluation over a collective reference set containing all the clients' data. In contrast, we show the same result does not apply to the FID-based evaluation. We provide examples in which two generative models are assigned the same FID score by each client in a distributed setting, while the centralized FID scores of the two models are significantly different. We perform several numerical experiments on standard image datasets and generative models to support our theoretical results on the distributed evaluation of generative models using FID and KID scores.
