Consistency-diversity-realism Pareto fronts of conditional image generative models
Pietro Astolfi, Marlene Careil, Melissa Hall, Oscar Mañas, Matthew Muckley, Jakob Verbeek, Adriana Romero Soriano, Michal Drozdzal
TL;DR
The paper tackles evaluating conditional image generative models as world models by balancing realism, consistency, and diversity through Pareto-front analysis. It systematically defines conditional and marginal metrics, catalogs knobs that control the multi-objective tradeoffs, and applies the approach to T2I and I-T2I models on MSCOCO and GeoDE. Key findings show that realism/consistency can improve together but often suppress diversity, with older models offering greater diversity and regional disparities persisting across geographies; knob choices like guidance and post-filtering strongly shape outcomes. The work positions Pareto fronts as a practical analytical tool to guide model selection for downstream world-model tasks and suggests directions for softer tradeoffs in future research.
Abstract
Building world models that accurately and comprehensively represent the real world is the utmost aspiration for conditional image generative models as it would enable their use as world simulators. For these models to be successful world models, they should not only excel at image quality and prompt-image consistency but also ensure high representation diversity. However, current research in generative models mostly focuses on creative applications that are predominantly concerned with human preferences of image quality and aesthetics. We note that generative models have inference time mechanisms - or knobs - that allow the control of generation consistency, quality, and diversity. In this paper, we use state-of-the-art text-to-image and image-and-text-to-image models and their knobs to draw consistency-diversity-realism Pareto fronts that provide a holistic view on consistency-diversity-realism multi-objective. Our experiments suggest that realism and consistency can both be improved simultaneously; however there exists a clear tradeoff between realism/consistency and diversity. By looking at Pareto optimal points, we note that earlier models are better at representation diversity and worse in consistency/realism, and more recent models excel in consistency/realism while decreasing significantly the representation diversity. By computing Pareto fronts on a geodiverse dataset, we find that the first version of latent diffusion models tends to perform better than more recent models in all axes of evaluation, and there exist pronounced consistency-diversity-realism disparities between geographical regions. Overall, our analysis clearly shows that there is no best model and the choice of model should be determined by the downstream application. With this analysis, we invite the research community to consider Pareto fronts as an analytical tool to measure progress towards world models.
