Table of Contents
Fetching ...

FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models

Lin Zhao, Tianchen Zhao, Zinan Lin, Xuefei Ning, Guohao Dai, Huazhong Yang, Yu Wang

TL;DR

This work systematically investigates the design choices, including the selection criteria (textural features or image-based metrics) and the selection granularity (prompt-level or set-level), and proposes FlashEval, an iterative search algorithm tailored to evaluation data selection.

Abstract

In recent years, there has been significant progress in the development of text-to-image generative models. Evaluating the quality of the generative models is one essential step in the development process. Unfortunately, the evaluation process could consume a significant amount of computational resources, making the required periodic evaluation of model performance (e.g., monitoring training progress) impractical. Therefore, we seek to improve the evaluation efficiency by selecting the representative subset of the text-image dataset. We systematically investigate the design choices, including the selection criteria (textural features or image-based metrics) and the selection granularity (prompt-level or set-level). We find that the insights from prior work on subset selection for training data do not generalize to this problem, and we propose FlashEval, an iterative search algorithm tailored to evaluation data selection. We demonstrate the effectiveness of FlashEval on ranking diffusion models with various configurations, including architectures, quantization levels, and sampler schedules on COCO and DiffusionDB datasets. Our searched 50-item subset could achieve comparable evaluation quality to the randomly sampled 500-item subset for COCO annotations on unseen models, achieving a 10x evaluation speedup. We release the condensed subset of these commonly used datasets to help facilitate diffusion algorithm design and evaluation, and open-source FlashEval as a tool for condensing future datasets, accessible at https://github.com/thu-nics/FlashEval.

FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models

TL;DR

This work systematically investigates the design choices, including the selection criteria (textural features or image-based metrics) and the selection granularity (prompt-level or set-level), and proposes FlashEval, an iterative search algorithm tailored to evaluation data selection.

Abstract

In recent years, there has been significant progress in the development of text-to-image generative models. Evaluating the quality of the generative models is one essential step in the development process. Unfortunately, the evaluation process could consume a significant amount of computational resources, making the required periodic evaluation of model performance (e.g., monitoring training progress) impractical. Therefore, we seek to improve the evaluation efficiency by selecting the representative subset of the text-image dataset. We systematically investigate the design choices, including the selection criteria (textural features or image-based metrics) and the selection granularity (prompt-level or set-level). We find that the insights from prior work on subset selection for training data do not generalize to this problem, and we propose FlashEval, an iterative search algorithm tailored to evaluation data selection. We demonstrate the effectiveness of FlashEval on ranking diffusion models with various configurations, including architectures, quantization levels, and sampler schedules on COCO and DiffusionDB datasets. Our searched 50-item subset could achieve comparable evaluation quality to the randomly sampled 500-item subset for COCO annotations on unseen models, achieving a 10x evaluation speedup. We release the condensed subset of these commonly used datasets to help facilitate diffusion algorithm design and evaluation, and open-source FlashEval as a tool for condensing future datasets, accessible at https://github.com/thu-nics/FlashEval.
Paper Structure (17 sections, 1 equation, 7 figures, 6 tables, 1 algorithm)

This paper contains 17 sections, 1 equation, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: The Motivation and Effectiveness of FlashEval . Left: Applications and the search method. Right: The excessive evaluation cost and how FlashEval improves the evaluation speed-quality trade-off (evaluation quality is represented by the ranking correlation of a variety of diffusion models w.r.t subset size on COCO).
  • Figure 2: Illustration of the reasons for baseline-3's failing under small item sizes $N'$. When combining multiple prompts with standalone high KD, the set-wise KD is lower.
  • Figure 3: Illustration of FlashEval search method. Inspired by the evolutionary algorithm, we design an iterative search algorithm on both the set- and prompt-level.
  • Figure 4: Comparisons with Baseline3 of the Kendall’s Tau of different metrics on COCO dataset for "Random" sub-task. The red, green, and blue dots respectively represent ours with 100 items, B3-set with 200 items, and random sample with 500 items.The shaded area are standard errors. The complete comparison results on the two datasets are in the appendix.
  • Figure 5: The evolution of model scores (sd1.5, DPM solver) with increasing steps.
  • ...and 2 more figures