Table of Contents
Fetching ...

GEBench: Benchmarking Image Generation Models as GUI Environments

Haodong Li, Jingwei Wu, Quan Sun, Guopeng Li, Juanxi Tian, Huanyu Zhang, Yanlin Lai, Ruichuan An, Hongbo Peng, Yuhong Dai, Chenxi Li, Chunmei Qing, Jia Wang, Ziyang Meng, Zheng Ge, Xiangyu Zhang, Daxin Jiang

TL;DR

GEBench introduces a GUI-focused benchmark with 700 interaction sequences across five task types to evaluate image-generation models as dynamic GUI environments. It pairs this benchmark with GE-Score, a five-dimensional metric (GOAL, LOGIC, CONS, UI, QUAL) aggregated over samples using VLM-guided evaluations to assess both functional correctness and visual fidelity. Findings show current models excel at single-step transitions but struggle with long-horizon planning and precise spatial grounding, with icon interpretation and text rendering identified as key bottlenecks. The framework, including a VLM-as-a-Judge evaluation pipeline, provides a foundation for developing high-fidelity, temporally coherent generative GUI systems for scalable GUI agent training.

Abstract

Recent advancements in image generation models have enabled the prediction of future Graphical User Interface (GUI) states based on user instructions. However, existing benchmarks primarily focus on general domain visual fidelity, leaving the evaluation of state transitions and temporal coherence in GUI-specific contexts underexplored. To address this gap, we introduce GEBench, a comprehensive benchmark for evaluating dynamic interaction and temporal coherence in GUI generation. GEBench comprises 700 carefully curated samples spanning five task categories, covering both single-step interactions and multi-step trajectories across real-world and fictional scenarios, as well as grounding point localization. To support systematic evaluation, we propose GE-Score, a novel five-dimensional metric that assesses Goal Achievement, Interaction Logic, Content Consistency, UI Plausibility, and Visual Quality. Extensive evaluations on current models indicate that while they perform well on single-step transitions, they struggle significantly with maintaining temporal coherence and spatial grounding over longer interaction sequences. Our findings identify icon interpretation, text rendering, and localization precision as critical bottlenecks. This work provides a foundation for systematic assessment and suggests promising directions for future research toward building high-fidelity generative GUI environments. The code is available at: https://github.com/stepfun-ai/GEBench.

GEBench: Benchmarking Image Generation Models as GUI Environments

TL;DR

GEBench introduces a GUI-focused benchmark with 700 interaction sequences across five task types to evaluate image-generation models as dynamic GUI environments. It pairs this benchmark with GE-Score, a five-dimensional metric (GOAL, LOGIC, CONS, UI, QUAL) aggregated over samples using VLM-guided evaluations to assess both functional correctness and visual fidelity. Findings show current models excel at single-step transitions but struggle with long-horizon planning and precise spatial grounding, with icon interpretation and text rendering identified as key bottlenecks. The framework, including a VLM-as-a-Judge evaluation pipeline, provides a foundation for developing high-fidelity, temporally coherent generative GUI systems for scalable GUI agent training.

Abstract

Recent advancements in image generation models have enabled the prediction of future Graphical User Interface (GUI) states based on user instructions. However, existing benchmarks primarily focus on general domain visual fidelity, leaving the evaluation of state transitions and temporal coherence in GUI-specific contexts underexplored. To address this gap, we introduce GEBench, a comprehensive benchmark for evaluating dynamic interaction and temporal coherence in GUI generation. GEBench comprises 700 carefully curated samples spanning five task categories, covering both single-step interactions and multi-step trajectories across real-world and fictional scenarios, as well as grounding point localization. To support systematic evaluation, we propose GE-Score, a novel five-dimensional metric that assesses Goal Achievement, Interaction Logic, Content Consistency, UI Plausibility, and Visual Quality. Extensive evaluations on current models indicate that while they perform well on single-step transitions, they struggle significantly with maintaining temporal coherence and spatial grounding over longer interaction sequences. Our findings identify icon interpretation, text rendering, and localization precision as critical bottlenecks. This work provides a foundation for systematic assessment and suggests promising directions for future research toward building high-fidelity generative GUI environments. The code is available at: https://github.com/stepfun-ai/GEBench.
Paper Structure (27 sections, 1 equation, 12 figures, 4 tables)

This paper contains 27 sections, 1 equation, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Comparison of evaluation paradigms across different benchmark types. Existing image generation benchmarks prioritize general-domain visual fidelity and video generation benchmarks evaluate continuous state transitions. GEBench uniquely evaluates discrete state transitions induced by user actions, capturing the essence of GUI interactions.
  • Figure 2: Examples of the five task types in GEBench, which are designed to comprehensively evaluate the capabilities of image generation models as GUI environments. GEBench provides image generation models with user instructions and reference GUI state (no reference provided for the Fiction App task) and evaluates the generated GUIs.
  • Figure 3: GEBench data construction pipeline. The process involves raw data capture through recording user interactions, task annotation of actions, quality control via preprocessing and verification, and data construction across five task categories: Single-Step, Multi-Step, Grounding, Real App, and Fictional App, totaling 700 samples.
  • Figure 4: Performance of models across GEBench task suites. The radar chart illustrates the performance of 12 prominent image generation models, including commercial models (solid line) and open-sourced models (dashed line). The reported results represent the average scores on Chinese and English subsets.
  • Figure 5: Comparison of GOAL score on grounding task. The universally low scores across all models highlight a critical deficiency in current generative models' ability to perceive and align with precise spatial grounding points.
  • ...and 7 more figures