Table of Contents
Fetching ...

RWKV-UI: UI Understanding with Enhanced Perception and Reasoning

Jiaxi Yang, Haowen Hou

TL;DR

RWKV-UI addresses the challenge of high-resolution UI understanding by integrating a three-encoder visual architecture with a lossless high-resolution partition strategy and visual prompts. It augments this with a CoT-based reasoning framework and layout detection to model webpage structure and user interactions, achieving strong performance on VisualWebBench with only 1.6B parameters. The combination of visual prompts, CoT reasoning, and domain-aware pretraining yields significant gains in element grounding, OCR, and action prediction, highlighting the approach's effectiveness for precise UI comprehension and interactive tasks. This work presents a practical, scalable framework for high-resolution UI analysis with potential applications in UI automation and multimodal web understanding.

Abstract

Existing Visual Language Modelsoften struggle with information loss and limited reasoning abilities when handling high-resolution web interfaces that combine complex visual, textual, and interactive elements. These challenges are particularly evident in tasks requiring webpage layout comprehension and multi-step interactive reasoning. To address these challenges, we propose RWKV-UI, a Visual Language Model based on the RWKV architecture, specifically designed to handle high-resolution UI images. During model training, we introduce layout detection as a visual prompt to help the model better understand the webpage layout structures. Additionally, we design a visual prompt based on the Chain-of-Thought(CoT) mechanism, which enhances the model's ability to understand and reason about webpage content through reasoning chains. Experimental results show that RWKV-UI demonstrates significant performance improvements in high-resolution UI understanding and interactive reasoning tasks.

RWKV-UI: UI Understanding with Enhanced Perception and Reasoning

TL;DR

RWKV-UI addresses the challenge of high-resolution UI understanding by integrating a three-encoder visual architecture with a lossless high-resolution partition strategy and visual prompts. It augments this with a CoT-based reasoning framework and layout detection to model webpage structure and user interactions, achieving strong performance on VisualWebBench with only 1.6B parameters. The combination of visual prompts, CoT reasoning, and domain-aware pretraining yields significant gains in element grounding, OCR, and action prediction, highlighting the approach's effectiveness for precise UI comprehension and interactive tasks. This work presents a practical, scalable framework for high-resolution UI analysis with potential applications in UI automation and multimodal web understanding.

Abstract

Existing Visual Language Modelsoften struggle with information loss and limited reasoning abilities when handling high-resolution web interfaces that combine complex visual, textual, and interactive elements. These challenges are particularly evident in tasks requiring webpage layout comprehension and multi-step interactive reasoning. To address these challenges, we propose RWKV-UI, a Visual Language Model based on the RWKV architecture, specifically designed to handle high-resolution UI images. During model training, we introduce layout detection as a visual prompt to help the model better understand the webpage layout structures. Additionally, we design a visual prompt based on the Chain-of-Thought(CoT) mechanism, which enhances the model's ability to understand and reason about webpage content through reasoning chains. Experimental results show that RWKV-UI demonstrates significant performance improvements in high-resolution UI understanding and interactive reasoning tasks.

Paper Structure

This paper contains 29 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The overrall architecture of our model. The model input includes the complete webpage image with visual prompts, the split images of its four sections, and the textual input.
  • Figure 2: Data samples based on visual prompt engineering include (a) webpage annotation data, (b) layout detection-based QA data, and (c) Chain-of-Thought reasoning-based data
  • Figure 3: Prompts for calling the GPT API.
  • Figure 4: An example of CoT data generated by calling the ChatGPT 4.0 API.
  • Figure 5: The results on seven metrics.