UI-UG: A Unified MLLM for UI Understanding and Generation

Hao Yang; Weijie Qiu; Ru Zhang; Zhou Fang; Ruichao Mao; Xiaoyu Lin; Maji Huang; Zhaosong Huang; Teng Guo; Shuoyang Liu; Hai Rao

UI-UG: A Unified MLLM for UI Understanding and Generation

Hao Yang, Weijie Qiu, Ru Zhang, Zhou Fang, Ruichao Mao, Xiaoyu Lin, Maji Huang, Zhaosong Huang, Teng Guo, Shuoyang Liu, Hai Rao

TL;DR

UI-UG tackles the need for accurate UI understanding and high-quality UI generation by unifying these capabilities within a single multimodal LLM. It employs a two-stage training pipeline: supervised fine-tuning on a UI-focused VQA dataset to bolster understanding, followed by reinforcement learning with GRPO for understanding and Direct Preference Optimization (DPO) for generation to align outputs with human preferences. A practical workflow is proposed, including an LLM-friendly UI DSL, progressive rendering, and task-specific evaluation metrics, enabling real-time, interactive UI construction. Experimental results show state-of-the-art performance on modern UI understanding benchmarks and generation performance on par with larger models at a much lower computational cost, with clear gains from jointly training understanding and generation tasks.

Abstract

Although Multimodal Large Language Models (MLLMs) have been widely applied across domains, they are still facing challenges in domain-specific tasks, such as User Interface (UI) understanding accuracy and UI generation quality. In this paper, we introduce UI-UG (a unified MLLM for UI Understanding and Generation), integrating both capabilities. For understanding tasks, we employ Supervised Fine-tuning (SFT) combined with Group Relative Policy Optimization (GRPO) to enhance fine-grained understanding on the modern complex UI data. For generation tasks, we further use Direct Preference Optimization (DPO) to make our model generate human-preferred UIs. In addition, we propose an industrially effective workflow, including the design of an LLM-friendly domain-specific language (DSL), training strategies, rendering processes, and evaluation metrics. In experiments, our model achieves state-of-the-art (SOTA) performance on understanding tasks, outperforming both larger general-purpose MLLMs and similarly-sized UI-specialized models. Our model is also on par with these larger MLLMs in UI generation performance at a fraction of the computational cost. We also demonstrate that integrating understanding and generation tasks can improve accuracy and quality for both tasks. Code and Model: https://github.com/neovateai/UI-UG

UI-UG: A Unified MLLM for UI Understanding and Generation

TL;DR

Abstract

UI-UG: A Unified MLLM for UI Understanding and Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)