GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?
Ruihang Li, Leigang Qu, Jingxu Zhang, Dongnan Gui, Mengde Xu, Xiaosong Zhang, Han Hu, Wenjie Wang, Jiaqi Wang
TL;DR
GenArena confronts the unreliability of absolute pointwise evaluation in visual generation by introducing a pairwise, VLM-based judging protocol whose outcomes are fused with the Elo rating system. The framework builds a large, diverse benchmark (6,086 prompts) and enforces a robust judging process with bi-directional checks and forced-choice, yielding an evaluation whose Spearman correlation with human preferences reaches $0.86$, far surpassing pointwise baselines ($0.36$). Crucially, open-source VLMs, when used in this pairwise setting, achieve state-of-the-art accuracy, at times outperforming proprietary systems without fine-tuning. GenArena thus provides a scalable, reproducible, and democratized standard for benchmarking next-generation visual generation models, with implications for reliability, interpretability, and faster progress—while acknowledging potential biases inherited from VLM training data.
Abstract
The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.
