Finding the Subjective Truth: Collecting 2 Million Votes for Comprehensive Gen-AI Model Evaluation

Dimitrios Christodoulou; Mads Kuhlmann-Jørgensen

Finding the Subjective Truth: Collecting 2 Million Votes for Comprehensive Gen-AI Model Evaluation

Dimitrios Christodoulou, Mads Kuhlmann-Jørgensen

TL;DR

It is demonstrated that the approach makes it feasible to comprehensively rank image generation models based on a vast pool of annotators and show that the diverse annotator demographics reflect the world population, significantly decreasing the risk of biases.

Abstract

Efficiently evaluating the performance of text-to-image models is difficult as it inherently requires subjective judgment and human preference, making it hard to compare different models and quantify the state of the art. Leveraging Rapidata's technology, we present an efficient annotation framework that sources human feedback from a diverse, global pool of annotators. Our study collected over 2 million annotations across 4,512 images, evaluating four prominent models (DALL-E 3, Flux.1, MidJourney, and Stable Diffusion) on style preference, coherence, and text-to-image alignment. We demonstrate that our approach makes it feasible to comprehensively rank image generation models based on a vast pool of annotators and show that the diverse annotator demographics reflect the world population, significantly decreasing the risk of biases.

Finding the Subjective Truth: Collecting 2 Million Votes for Comprehensive Gen-AI Model Evaluation

TL;DR

Abstract

Paper Structure (14 sections, 1 equation, 8 figures, 1 table)

This paper contains 14 sections, 1 equation, 8 figures, 1 table.

Introduction
Methodology
Annotation Framework
Interface Design
Distribution
Quality Assurance
Prompts
Image Generation
Results
Data
Ranking Algorithm
Ranking
Demographic Statistics
Conclusion

Figures (8)

Figure 1: Example outputs from the four evaluated models based on the prompt: A yellow colored giraffe.
Figure 2: Illustration of the interface used for annotation.
Figure 3: Ranking of the four models, Flux.1, DALL-E 3, MidJourney, and Stable Diffusion, based on the three criteria, style, coherence, and text-to-image alignment.
Figure 4: Example outputs from the four evaluated models based on the prompt: A pizza cooking an oven
Figure 5: Example outputs from Flux.1 and DALL-E 3 based on the prompt: 'Tcennis rpacket'.
...and 3 more figures

Finding the Subjective Truth: Collecting 2 Million Votes for Comprehensive Gen-AI Model Evaluation

TL;DR

Abstract

Finding the Subjective Truth: Collecting 2 Million Votes for Comprehensive Gen-AI Model Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)