GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?

Ruihang Li; Leigang Qu; Jingxu Zhang; Dongnan Gui; Mengde Xu; Xiaosong Zhang; Han Hu; Wenjie Wang; Jiaqi Wang

GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?

Ruihang Li, Leigang Qu, Jingxu Zhang, Dongnan Gui, Mengde Xu, Xiaosong Zhang, Han Hu, Wenjie Wang, Jiaqi Wang

TL;DR

GenArena confronts the unreliability of absolute pointwise evaluation in visual generation by introducing a pairwise, VLM-based judging protocol whose outcomes are fused with the Elo rating system. The framework builds a large, diverse benchmark (6,086 prompts) and enforces a robust judging process with bi-directional checks and forced-choice, yielding an evaluation whose Spearman correlation with human preferences reaches $0.86$, far surpassing pointwise baselines ($0.36$). Crucially, open-source VLMs, when used in this pairwise setting, achieve state-of-the-art accuracy, at times outperforming proprietary systems without fine-tuning. GenArena thus provides a scalable, reproducible, and democratized standard for benchmarking next-generation visual generation models, with implications for reliability, interpretability, and faster progress—while acknowledging potential biases inherited from VLM training data.

Abstract

The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.

GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?

TL;DR

, far surpassing pointwise baselines (

). Crucially, open-source VLMs, when used in this pairwise setting, achieve state-of-the-art accuracy, at times outperforming proprietary systems without fine-tuning. GenArena thus provides a scalable, reproducible, and democratized standard for benchmarking next-generation visual generation models, with implications for reliability, interpretability, and faster progress—while acknowledging potential biases inherited from VLM training data.

Abstract

Paper Structure (31 sections, 5 equations, 7 figures, 6 tables)

This paper contains 31 sections, 5 equations, 7 figures, 6 tables.

Introduction
Related Work
LLM-as-a-Judge.
Visual Generation Benchmarks.
Elo Rating System in AI Evaluation.
Revisiting the VLM-as-a-judge Paradigm
Pairwise scoring is more accurate than its pointwise alternatives
Pairwise scoring is more consistent than its pointwise alternatives
GenArena
Overview
Benchmark Composition
Automatic Pairwise Scoring
Elo Ranking Aggregation
Evaluation and Benchmarking
Elo ranking aligns more with human preferences
...and 16 more sections

Figures (7)

Figure 1: Schemetric illustration of the comparison between pointwise and pairwise scoring, the GenArena Benchmark, and the Elo Rating System. (a) Current benchmarks rely on absolute pointwise scoring, which suffers from self-consistency collapse. As shown, stochastic fluctuations in VLM outputs result in volatile rankings (e.g., $A>B$ in Try 1, but $B>A$ in Try 2) for the same input. In contrast, pairwise comparison yields consistent and robust preferences. (b)GenArena builds upon this stable pairwise paradigm. We curate a diverse set of prompts (including multi-reference generation tasks) and conduct large-scale peer battles using VLMs as judges. These pairwise outcomes are aggregated via the Elo Rating System to produce an accurate and reproducible model leaderboard.
Figure A.1: Qualitative comparison of visual generation tasks and generated results from various models in GenArena. The benchmark assesses models on three distinct dimensions: Basic, Reasoning (e.g., predicting environmental effects over time), and MultiRef (composing scenes from multiple image conditions). The bottom row displays sample outputs from leading proprietary and open-source models (e.g., GPT-Image-1.5, FLUX.2, Qwen-Image) on a complex multi-reference composition task, highlighting the variance in adherence to spatial and material constraints.
Figure A.2: Qualitative comparison of multi-reference generation tasks and generated results from various models in GenArena.
Figure A.3: Distribution of Score Differences in Pointwise Evaluation. We visualize the score difference ($\Delta S = S_{\text{better}} - S_{\text{worse}}$) assigned by Qwen3-VL 8B Instruct on EditScore-Bench luo2025editscore under the pointwise paradigm. The green region ($\Delta S > 0$) denotes correct alignment with human preference, while the red region ($\Delta S \leq 0$) indicates contradictory rankings. Notably, the distribution reveals a significant limitation in discriminative power: only 58.3% of cases are correctly ranked. A substantial portion of comparisons result in ties (23.5%, spike at $\Delta S = 0$) or direct errors (18.2%), confirming that absolute pointwise scoring struggles to resolve fine-grained visual differences.
Figure A.4: Qualitative comparison of pointwise judgment and pairwise judgment.
...and 2 more figures

GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?

TL;DR

Abstract

GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?

Authors

TL;DR

Abstract

Table of Contents

Figures (7)