PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks

Junxian Li; Kai Liu; Leyang Chen; Weida Wang; Zhixin Wang; Jiaqi Xu; Fan Li; Renjing Pei; Linghe Kong; Yulun Zhang

PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks

Junxian Li, Kai Liu, Leyang Chen, Weida Wang, Zhixin Wang, Jiaqi Xu, Fan Li, Renjing Pei, Linghe Kong, Yulun Zhang

TL;DR

PlanViz addresses evaluating planning-oriented image generation and editing for computer-use tasks by introducing PlanScore, a task-adaptive metric with Cor, Vis, and Ef, where Cor = $|\\mathcal{P}_s| / |\\mathcal{P}|$, Vis = $S_v/5$, and Ef = $S_e/5$. It covers three sub-tasks—route planning, workflow diagramming, and web&UI displaying—and builds a high-quality, annotated data set with a data-construction pipeline and prompt-style diversification. Extensive experiments across 13 open-source or proprietary UMMs and 9 image-generation/editing models reveal significant gaps between open-source and closed-source systems and greater difficulty in planning-intensive editing, underscoring the need to integrate reasoning, planning, and visual generation for real-world computer-use assistance. PlanViz thus offers a comprehensive benchmarking framework to drive progress in planning-aware multimodal systems and informs future work on improving precision, layout understanding, and task-guided generation.

Abstract

Unified multimodal models (UMMs) have shown impressive capabilities in generating natural images and supporting multimodal reasoning. However, their potential in supporting computer-use planning tasks, which are closely related to our lives, remain underexplored. Image generation and editing in computer-use tasks require capabilities like spatial reasoning and procedural understanding, and it is still unknown whether UMMs have these capabilities to finish these tasks or not. Therefore, we propose PlanViz, a new benchmark designed to evaluate image generation and editing for computer-use tasks. To achieve the goal of our evaluation, we focus on sub-tasks which frequently involve in daily life and require planning steps. Specifically, three new sub-tasks are designed: route planning, work diagramming, and web&UI displaying. We address challenges in data quality ensuring by curating human-annotated questions and reference images, and a quality control process. For challenges of comprehensive and exact evaluation, a task-adaptive score, PlanScore, is proposed. The score helps understanding the correctness, visual quality and efficiency of generated images. Through experiments, we highlight key limitations and opportunities for future research on this topic.

PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks

TL;DR

PlanViz addresses evaluating planning-oriented image generation and editing for computer-use tasks by introducing PlanScore, a task-adaptive metric with Cor, Vis, and Ef, where Cor =

, Vis =

, and Ef =

. It covers three sub-tasks—route planning, workflow diagramming, and web&UI displaying—and builds a high-quality, annotated data set with a data-construction pipeline and prompt-style diversification. Extensive experiments across 13 open-source or proprietary UMMs and 9 image-generation/editing models reveal significant gaps between open-source and closed-source systems and greater difficulty in planning-intensive editing, underscoring the need to integrate reasoning, planning, and visual generation for real-world computer-use assistance. PlanViz thus offers a comprehensive benchmarking framework to drive progress in planning-aware multimodal systems and informs future work on improving precision, layout understanding, and task-guided generation.

Abstract

Paper Structure (15 sections, 6 figures, 5 tables)

This paper contains 15 sections, 6 figures, 5 tables.

Introduction
Related Work
PlanViz
Motivation: Task-planning-based Evaluation
Data Construction
Score Judgement Pipeline
Human Evaluation
Experiment
Implementation
Main Results
Open-ended v.s. Closed-ended
Distribution of Scores
Influence of prompt styles.
Case Study
Conclusion

Figures (6)

Figure 1: Examples of generation (left) and editing (right). The queries are "generating a flowchart on how to apply a VISA" and "show what happens if setting 'Chinese(Simplified)' to the display language". Both UMMs make mistakes: Bagel doesn't provide a complete workflow and texts are meaningless; GPT-Image-1 provides garbled characters and fails to keep the total layout.
Figure 2: The overview of PlanViz. Our evaluation includes image generation and editing, with three proposed subtasks: route planning, work diagramming, and web&UI dsisplaying. Compared with existing benchmarks, we introduce a new domain, computer-use tasks for the application of UMMs, and explore the planning capabilities of them with huge human effort.
Figure 3: Pipeline of data construction. It consists of four stages: high-quality data collecting and cleaning, human annotation, quality check, and prompt style transformation. These stages are displayed from the top-left to the bottom right.
Figure 4: The distribution (left) and the word cloud (right) of our benchmark. The green part represents route planning, while the orange part represents workflow diagramming, and the blue part represents web&UI displaying. The word cloud shows the hot topics in all questions of our benchmark.
Figure 5: Score distribution across different models. We choose Cor (top) and Vis (bottom) of route planning.
...and 1 more figures

PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks

TL;DR

Abstract

PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (6)