Table of Contents
Fetching ...

UI-UG: A Unified MLLM for UI Understanding and Generation

Hao Yang, Weijie Qiu, Ru Zhang, Zhou Fang, Ruichao Mao, Xiaoyu Lin, Maji Huang, Zhaosong Huang, Teng Guo, Shuoyang Liu, Hai Rao

TL;DR

UI-UG tackles the need for accurate UI understanding and high-quality UI generation by unifying these capabilities within a single multimodal LLM. It employs a two-stage training pipeline: supervised fine-tuning on a UI-focused VQA dataset to bolster understanding, followed by reinforcement learning with GRPO for understanding and Direct Preference Optimization (DPO) for generation to align outputs with human preferences. A practical workflow is proposed, including an LLM-friendly UI DSL, progressive rendering, and task-specific evaluation metrics, enabling real-time, interactive UI construction. Experimental results show state-of-the-art performance on modern UI understanding benchmarks and generation performance on par with larger models at a much lower computational cost, with clear gains from jointly training understanding and generation tasks.

Abstract

Although Multimodal Large Language Models (MLLMs) have been widely applied across domains, they are still facing challenges in domain-specific tasks, such as User Interface (UI) understanding accuracy and UI generation quality. In this paper, we introduce UI-UG (a unified MLLM for UI Understanding and Generation), integrating both capabilities. For understanding tasks, we employ Supervised Fine-tuning (SFT) combined with Group Relative Policy Optimization (GRPO) to enhance fine-grained understanding on the modern complex UI data. For generation tasks, we further use Direct Preference Optimization (DPO) to make our model generate human-preferred UIs. In addition, we propose an industrially effective workflow, including the design of an LLM-friendly domain-specific language (DSL), training strategies, rendering processes, and evaluation metrics. In experiments, our model achieves state-of-the-art (SOTA) performance on understanding tasks, outperforming both larger general-purpose MLLMs and similarly-sized UI-specialized models. Our model is also on par with these larger MLLMs in UI generation performance at a fraction of the computational cost. We also demonstrate that integrating understanding and generation tasks can improve accuracy and quality for both tasks. Code and Model: https://github.com/neovateai/UI-UG

UI-UG: A Unified MLLM for UI Understanding and Generation

TL;DR

UI-UG tackles the need for accurate UI understanding and high-quality UI generation by unifying these capabilities within a single multimodal LLM. It employs a two-stage training pipeline: supervised fine-tuning on a UI-focused VQA dataset to bolster understanding, followed by reinforcement learning with GRPO for understanding and Direct Preference Optimization (DPO) for generation to align outputs with human preferences. A practical workflow is proposed, including an LLM-friendly UI DSL, progressive rendering, and task-specific evaluation metrics, enabling real-time, interactive UI construction. Experimental results show state-of-the-art performance on modern UI understanding benchmarks and generation performance on par with larger models at a much lower computational cost, with clear gains from jointly training understanding and generation tasks.

Abstract

Although Multimodal Large Language Models (MLLMs) have been widely applied across domains, they are still facing challenges in domain-specific tasks, such as User Interface (UI) understanding accuracy and UI generation quality. In this paper, we introduce UI-UG (a unified MLLM for UI Understanding and Generation), integrating both capabilities. For understanding tasks, we employ Supervised Fine-tuning (SFT) combined with Group Relative Policy Optimization (GRPO) to enhance fine-grained understanding on the modern complex UI data. For generation tasks, we further use Direct Preference Optimization (DPO) to make our model generate human-preferred UIs. In addition, we propose an industrially effective workflow, including the design of an LLM-friendly domain-specific language (DSL), training strategies, rendering processes, and evaluation metrics. In experiments, our model achieves state-of-the-art (SOTA) performance on understanding tasks, outperforming both larger general-purpose MLLMs and similarly-sized UI-specialized models. Our model is also on par with these larger MLLMs in UI generation performance at a fraction of the computational cost. We also demonstrate that integrating understanding and generation tasks can improve accuracy and quality for both tasks. Code and Model: https://github.com/neovateai/UI-UG

Paper Structure

This paper contains 47 sections, 8 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: With the development of UI resolution and design, modern apps now include more icons, ads, and complex elements, creating new challenges in understanding UIs.
  • Figure 2: The workflow for UI-UG includes 1) Data preparation (UI image collection + element detection + DSL generation); 2) Two-stage training: SFT with VQA dataset, then RL optimization using GRPO and DPO for each task. The model supports UI understanding tasks (referring and grounding) and enables both offline and real-time UI generation.
  • Figure 3: Pipeline for UI DSL Dataset creation. The final data tuples consist of requirements, reference images, and corresponding UI DSLs.
  • Figure 4: Generation scores for different models, with each dimension normalized. Our model UI-UG achieved quality enhancements through reinforcement learning and now approaches the level of current powerful larger models.
  • Figure 5: Display of all our UI categories.
  • ...and 5 more figures