On the Distributed Evaluation of Generative Models

Zixiao Wang; Farzan Farnia; Zhenghao Lin; Yunheng Shen; Bei Yu

On the Distributed Evaluation of Generative Models

Zixiao Wang, Farzan Farnia, Zhenghao Lin, Yunheng Shen, Bei Yu

TL;DR

The paper addresses evaluating generative models in distributed, non-i.i.d. settings by comparing two aggregation schemes for distance-based metrics: $\mathcal{D}_{\mathrm{all}}$ (centralized reference) and $\mathcal{D}_{\mathrm{avg}}$ (client-averaged). It proves that $\mathrm{KID}_{\mathrm{avg}}$ preserves the ranking of $\mathrm{KID}_{\mathrm{all}}$, making it a robust, communication-efficient proxy for distributed evaluation, while $\mathrm{FID}_{\mathrm{avg}}$ need not preserve the ranking of $\mathrm{FID}_{\mathrm{all}}$ and can misrank models under data heterogeneity. Through extensive experiments on FFHQ, CIFAR, AFHQ, and ImageNet-32, the authors demonstrate consistent KID-based rankings and illustrate scenarios where FID-based rankings diverge, including mode-collapse detection. The work suggests adopting $\mathrm{KID}_{\mathrm{avg}}$ as a privacy-preserving, scalable tool for distributed model evaluation and highlights avenues for extending analysis to other metrics and domains.

Abstract

The evaluation of deep generative models has been extensively studied in the centralized setting, where the reference data are drawn from a single probability distribution. On the other hand, several applications of generative models concern distributed settings, e.g. the federated learning setting, where the reference data for conducting evaluation are provided by several clients in a network. In this paper, we study the evaluation of generative models in such distributed contexts with potentially heterogeneous data distributions across clients. We focus on the widely-used distance-based evaluation metrics, Fréchet Inception Distance (FID) and Kernel Inception Distance (KID). In the case of KID metric, we prove that scoring a group of generative models using the clients' averaged KID score will result in the same ranking as that of a centralized KID evaluation over a collective reference set containing all the clients' data. In contrast, we show the same result does not apply to the FID-based evaluation. We provide examples in which two generative models are assigned the same FID score by each client in a distributed setting, while the centralized FID scores of the two models are significantly different. We perform several numerical experiments on standard image datasets and generative models to support our theoretical results on the distributed evaluation of generative models using FID and KID scores.

On the Distributed Evaluation of Generative Models

TL;DR

The paper addresses evaluating generative models in distributed, non-i.i.d. settings by comparing two aggregation schemes for distance-based metrics:

(centralized reference) and

(client-averaged). It proves that

preserves the ranking of

, making it a robust, communication-efficient proxy for distributed evaluation, while

need not preserve the ranking of

and can misrank models under data heterogeneity. Through extensive experiments on FFHQ, CIFAR, AFHQ, and ImageNet-32, the authors demonstrate consistent KID-based rankings and illustrate scenarios where FID-based rankings diverge, including mode-collapse detection. The work suggests adopting

as a privacy-preserving, scalable tool for distributed model evaluation and highlights avenues for extending analysis to other metrics and domains.

Abstract

Paper Structure (33 sections, 3 theorems, 13 equations, 22 figures, 2 tables)

This paper contains 33 sections, 3 theorems, 13 equations, 22 figures, 2 tables.

Introduction
Related Work
Preliminaries
Evaluation of Generative Models in Distributed Contexts
KID-based Evaluation in Distributed Settings
FID-based Evaluation in Distributed Settings
Numerical Results
Distributed vs. Centralized Evaluation using the FID metric
Comparison between Rankings implied by FID-based and KID-based Distributed Scores
Mode Collapse Detection in Distributed Settings via KID-avg
Conclusion
Proofs
Proof of Theorem 1
Proof of Theorem 2
Proof of Proposition \ref{['prop: FID-equal']}
...and 18 more sections

Key Result

Theorem 1

Consider a kernel function $k:\mathbb{R}^d\times \mathbb{R}^d\rightarrow \mathbb{R}$ and the resulting KID score. Then for the clients' distributions $P_{X_1},\ldots , P_{X_k}$ with frequency parameters $\lambda_1,\ldots , \lambda_k$, we will have the following for the average distribution $\widehat which implies a monotonic relationship between KID-all and KID-avg as a function of $P_G$.

Figures (22)

Figure 1: Centralized (score-all) vs. distributed (score-avg) evaluation of generative models. The centralized and distributed FID-based evaluations can lead to different rankings, while we prove the KID-based evaluations result in the same ranking of a group of generative models.
Figure 2: Visual examples of generators and clients in \ref{['tab:same_client_fid']}.
Figure 3: Illustration of generators with different truncation factors. $\tau$ is the truncation parameter.
Figure 4: Illustration of simulated clients with heterogeneous distributions via truncation technique.
Figure 5: The results of evaluating generators in \ref{['fig:styleG']} over clients in \ref{['fig:styleC']}.
...and 17 more figures

Theorems & Definitions (4)

Theorem 1
Theorem 2
Remark 1
Proposition 1

On the Distributed Evaluation of Generative Models

TL;DR

Abstract

On the Distributed Evaluation of Generative Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (22)

Theorems & Definitions (4)