Table of Contents
Fetching ...

EvalGIM: A Library for Evaluating Generative Image Models

Melissa Hall, Oscar Mañas, Reyhane Askari-Hemmat, Mark Ibrahim, Candace Ross, Pietro Astolfi, Tariq Berrada Ifriqi, Marton Havasi, Yohann Benchetrit, Karen Ullrich, Carolina Braga, Abhishek Charnalia, Maeve Ryan, Mike Rabbat, Michal Drozdzal, Jakob Verbeek, Adriana Romero-Soriano

TL;DR

EvalGIM introduces a unified, extensible library for evaluating text-to-image generative models across diverse datasets, metrics, and visualizations. It provides modular components for models, datasets, metrics, and visualizations, plus Evaluation Exercises that yield actionable insights on trade-offs, geographic group representation, ranking robustness, and prompting styles. The framework supports plug-and-play additions of datasets and metrics, distributed sweeps, and end-to-end, reproducible analyses, demonstrated through four Evaluation Exercises that reproduce prior methods and extend them with new analyses. By unifying evaluation components and emphasizing actionable takeaways, EvalGIM aims to accelerate robust benchmarking and methodological progress in text-to-image generation research.

Abstract

As the use of text-to-image generative models increases, so does the adoption of automatic benchmarking methods used in their evaluation. However, while metrics and datasets abound, there are few unified benchmarking libraries that provide a framework for performing evaluations across many datasets and metrics. Furthermore, the rapid introduction of increasingly robust benchmarking methods requires that evaluation libraries remain flexible to new datasets and metrics. Finally, there remains a gap in synthesizing evaluations in order to deliver actionable takeaways about model performance. To enable unified, flexible, and actionable evaluations, we introduce EvalGIM (pronounced ''EvalGym''), a library for evaluating generative image models. EvalGIM contains broad support for datasets and metrics used to measure quality, diversity, and consistency of text-to-image generative models. In addition, EvalGIM is designed with flexibility for user customization as a top priority and contains a structure that allows plug-and-play additions of new datasets and metrics. To enable actionable evaluation insights, we introduce ''Evaluation Exercises'' that highlight takeaways for specific evaluation questions. The Evaluation Exercises contain easy-to-use and reproducible implementations of two state-of-the-art evaluation methods of text-to-image generative models: consistency-diversity-realism Pareto Fronts and disaggregated measurements of performance disparities across groups. EvalGIM also contains Evaluation Exercises that introduce two new analysis methods for text-to-image generative models: robustness analyses of model rankings and balanced evaluations across different prompt styles. We encourage text-to-image model exploration with EvalGIM and invite contributions at https://github.com/facebookresearch/EvalGIM/.

EvalGIM: A Library for Evaluating Generative Image Models

TL;DR

EvalGIM introduces a unified, extensible library for evaluating text-to-image generative models across diverse datasets, metrics, and visualizations. It provides modular components for models, datasets, metrics, and visualizations, plus Evaluation Exercises that yield actionable insights on trade-offs, geographic group representation, ranking robustness, and prompting styles. The framework supports plug-and-play additions of datasets and metrics, distributed sweeps, and end-to-end, reproducible analyses, demonstrated through four Evaluation Exercises that reproduce prior methods and extend them with new analyses. By unifying evaluation components and emphasizing actionable takeaways, EvalGIM aims to accelerate robust benchmarking and methodological progress in text-to-image generation research.

Abstract

As the use of text-to-image generative models increases, so does the adoption of automatic benchmarking methods used in their evaluation. However, while metrics and datasets abound, there are few unified benchmarking libraries that provide a framework for performing evaluations across many datasets and metrics. Furthermore, the rapid introduction of increasingly robust benchmarking methods requires that evaluation libraries remain flexible to new datasets and metrics. Finally, there remains a gap in synthesizing evaluations in order to deliver actionable takeaways about model performance. To enable unified, flexible, and actionable evaluations, we introduce EvalGIM (pronounced ''EvalGym''), a library for evaluating generative image models. EvalGIM contains broad support for datasets and metrics used to measure quality, diversity, and consistency of text-to-image generative models. In addition, EvalGIM is designed with flexibility for user customization as a top priority and contains a structure that allows plug-and-play additions of new datasets and metrics. To enable actionable evaluation insights, we introduce ''Evaluation Exercises'' that highlight takeaways for specific evaluation questions. The Evaluation Exercises contain easy-to-use and reproducible implementations of two state-of-the-art evaluation methods of text-to-image generative models: consistency-diversity-realism Pareto Fronts and disaggregated measurements of performance disparities across groups. EvalGIM also contains Evaluation Exercises that introduce two new analysis methods for text-to-image generative models: robustness analyses of model rankings and balanced evaluations across different prompt styles. We encourage text-to-image model exploration with EvalGIM and invite contributions at https://github.com/facebookresearch/EvalGIM/.

Paper Structure

This paper contains 76 sections, 8 figures.

Figures (8)

  • Figure 1: EvalGIM (pronounced as "EvalGym") is an easy-to-use evaluation library for text-to-image generative models that unifies useful evaluation metrics, datasets, and visualizations, supports flexibility for user needs (and extensibility to future benchmarks), and provides actionable insights into model performance. To enable interpretable benchmarking, EvalGIM contains Evaluation Exercises that highlight takeaways for specific evaluation questions related to performance trade-offs, group representation, model ranking robustness, and prompting styles.
  • Figure 2: EvalGIM contains Evaluation Exercises that allow for structured end-to-end evaluations targeting specific analysis goals. The Exercises can be easily executed in a reproducible manner with user friendly notebooks.
  • Figure 3: Utilizing the Trade-offs Evaluation Exercise gives insights into the relationship between quality, diversity, and consistency. When applied to preliminary studies of early training of a text-to-image generative model, consistency (as measured by VQAScore) increases then plateaus, while automatic measures of quality and diversity can fluctuate.
  • Figure 4: Finer-detailed image improvements identified via qualitative inspection do not always translate to improvements in automatic measures of quality and diversity.
  • Figure 5: Using the Group Representation Evaluation Exercise provides insights into potential disparities in model performance across groups and whether improvements over successive model generations have occurred similarly across groups. When studying successive versions of a latent diffusion model with increasingly complex training data and fine-tuning methods, we find that advancements correspond to an improvement in quality and diversity more for some geographic regions (e.g. Southeast Asia) than others (e.g. Africa).
  • ...and 3 more figures