Table of Contents
Fetching ...

Computer-Use Agents as Judges for Generative User Interface

Kevin Qinghong Lin, Siyuan Hu, Linjie Li, Zhengyuan Yang, Lijuan Wang, Philip Torr, Mike Zheng Shou

TL;DR

The paper addresses the misalignment between human-centric GUI design and autonomous Computer-Use Agents by reframing the UI as an agent-native environment and introducing a Coder–CUA collaborative loop. It presents AUI-Gym, a benchmark of 52 apps with 1,560 GPT-5–generated tasks and per-task verifiers, enabling automated, functional testing of UI designs. The authors demonstrate that integrating task-solvability feedback with navigation feedback through a CUA Dashboard substantially improves both task success and robustness across diverse domains, with notable gains for weaker coders. This agent-centric approach offers scalable, interpretable guidance for automatic UI design and testing, potentially reducing reliance on human-centric aesthetics in favor of agent efficiency and reliability.

Abstract

Computer-Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily for humans--prioritizing aesthetics and usability--forcing agents to adopt human-oriented behaviors that are unnecessary for efficient task execution. At the same time, rapid advances in coding-oriented language models (Coder) have transformed automatic GUI design. This raises a fundamental question: Can CUA as judges to assist Coder for automatic GUI design? To investigate, we introduce AUI-Gym, a benchmark for Automatic GUI development spanning 52 applications across diverse domains. Using language models, we synthesize 1560 tasks that simulate real-world scenarios. To ensure task reliability, we further develop a verifier that programmatically checks whether each task is executable within its environment. Building on this, we propose a Coder-CUA in Collaboration framework: the Coder acts as Designer, generating and revising websites, while the CUA serves as Judge, evaluating functionality and refining designs. Success is measured not by visual appearance, but by task solvability and CUA navigation success rate. To turn CUA feedback into usable guidance, we design a CUA Dashboard that compresses multi-step navigation histories into concise visual summaries, offering interpretable guidance for iterative redesign. By positioning agents as both designers and judges, our framework shifts interface design toward agent-native efficiency and reliability. Our work takes a step toward shifting agents from passive use toward active participation in digital environments. Our code and dataset are available at https://github.com/showlab/AUI.

Computer-Use Agents as Judges for Generative User Interface

TL;DR

The paper addresses the misalignment between human-centric GUI design and autonomous Computer-Use Agents by reframing the UI as an agent-native environment and introducing a Coder–CUA collaborative loop. It presents AUI-Gym, a benchmark of 52 apps with 1,560 GPT-5–generated tasks and per-task verifiers, enabling automated, functional testing of UI designs. The authors demonstrate that integrating task-solvability feedback with navigation feedback through a CUA Dashboard substantially improves both task success and robustness across diverse domains, with notable gains for weaker coders. This agent-centric approach offers scalable, interpretable guidance for automatic UI design and testing, potentially reducing reliance on human-centric aesthetics in favor of agent efficiency and reliability.

Abstract

Computer-Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily for humans--prioritizing aesthetics and usability--forcing agents to adopt human-oriented behaviors that are unnecessary for efficient task execution. At the same time, rapid advances in coding-oriented language models (Coder) have transformed automatic GUI design. This raises a fundamental question: Can CUA as judges to assist Coder for automatic GUI design? To investigate, we introduce AUI-Gym, a benchmark for Automatic GUI development spanning 52 applications across diverse domains. Using language models, we synthesize 1560 tasks that simulate real-world scenarios. To ensure task reliability, we further develop a verifier that programmatically checks whether each task is executable within its environment. Building on this, we propose a Coder-CUA in Collaboration framework: the Coder acts as Designer, generating and revising websites, while the CUA serves as Judge, evaluating functionality and refining designs. Success is measured not by visual appearance, but by task solvability and CUA navigation success rate. To turn CUA feedback into usable guidance, we design a CUA Dashboard that compresses multi-step navigation histories into concise visual summaries, offering interpretable guidance for iterative redesign. By positioning agents as both designers and judges, our framework shifts interface design toward agent-native efficiency and reliability. Our work takes a step toward shifting agents from passive use toward active participation in digital environments. Our code and dataset are available at https://github.com/showlab/AUI.

Paper Structure

This paper contains 20 sections, 2 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Illustration of Humans Collaboration vs. our Coder-CUA Collaboration in term of UI designs.Left: Most GUIs are designed by humans and optimized for user experience (e.g., aesthetics), forcing trained agents to adapt to human-oriented behaviors. Right: Our Coder-CUA Collaboration framework leverages Coder as Designer and CUA as Judge together, enabling more reliable task execution and improved usability for agents.
  • Figure 2: AUI-Gym task definition. A user issues a request (e.g., “Create a Data Visualization Playground”), and agents (e.g., Coder or CUA) interact with the GUI through design, exploration, and feedback. In this setup, the GUI serves as a tunable environment.
  • Figure 3: AUI-Gym construction pipeline.(i) An input query specifies the app requirements. (ii) GPT-5 proposes candidate tasks with explicit goals. (iii) Humans filter and refine tasks using domain-specific principles. (iv) A test-time Verifier reads the website HTML and generates task-specific, rule-based checkers to validate success on the to-be-tested website.
  • Figure 4: Overview of the Coder-CUA in Collaboration framework. The process begins with the Coder as Designer, which initializes and iteratively revises the UI based on queries and feedback. In parallel, the CUA as Judge executes task-driven navigation within the testing environment, generating trajectories and error logs to evaluate task solvability. A verifier ensures functional correctness, while feedback from CUA navigation informs subsequent UI revisions. This collaboration yields a finalized agent-centric UI optimized for both functionality and execution success.
  • Figure 5: Ablation Studies of CUA Dashboard and Iterative rounds. Left (a-b): Effects by CUA Dashboard. Right (c-d): Performance across different iterative revision rounds.
  • ...and 2 more figures