Table of Contents
Fetching ...

GUI Action Narrator: Where and When Did That Action Take Place?

Qinchen Wu, Difei Gao, Kevin Qinghong Lin, Zhuoyu Wu, Xiangwu Guo, Peiran Li, Weichen Zhang, Hengxu Wang, Mike Zheng Shou

TL;DR

This work introduces Act2Cap, a GUI video-caption benchmark of 4,189 samples capturing atomic GUI actions across diverse applications, and GUI Narrator, a cursor-grounded two-stage framework for narrating GUI events. By combining action-aware spatial and temporal sampling with a captioning model, the approach improves over baseline multimodal models, including GPT-4o, and demonstrates gains when used for fine-tuning open models or as prompting signals in closed models. The study provides a rigorous evaluation methodology using an LLM-based semantic IoU for action-type grounding and reveals the persistent difficulty of GUI action understanding for current models. The contributions offer a practical pathway to enhance GUI automation systems by enabling learning from user demonstrations and improving task-level grounding in dense, high-resolution interfaces.

Abstract

The advent of Multimodal LLMs has significantly enhanced image OCR recognition capabilities, making GUI automation a viable reality for increasing efficiency in digital tasks. One fundamental aspect of developing a GUI automation system is understanding primitive GUI actions. This comprehension is crucial as it enables agents to learn from user demonstrations, an essential element of automation. To rigorously evaluate such capabilities, we developed a video captioning benchmark for GUI actions, comprising 4,189 diverse video captioning samples. This task presents unique challenges compared to natural scene video captioning: 1) GUI screenshots typically contain denser information than natural scenes, and 2) events within GUIs are subtler and occur more rapidly, requiring precise attention to the appropriate time span and spatial region for accurate understanding. To address these challenges, we introduce our GUI action dataset \textbf{Act2Cap} as well as a simple yet effective framework, \textbf{GUI Narrator}, for GUI video captioning that utilizes the cursor as a visual prompt to enhance the interpretation of high-resolution screenshots. Specifically, a cursor detector is trained on our dataset, and a multimodal LLM model with mechanisms for selecting keyframes and key regions generates the captions. Experimental results indicate that even for today's most advanced multimodal models, such as GPT-4o, the task remains highly challenging. Additionally, our evaluations show that our strategy effectively enhances model performance, whether integrated into the fine-tuning of open-source models or employed as a prompting strategy in closed-source models.

GUI Action Narrator: Where and When Did That Action Take Place?

TL;DR

This work introduces Act2Cap, a GUI video-caption benchmark of 4,189 samples capturing atomic GUI actions across diverse applications, and GUI Narrator, a cursor-grounded two-stage framework for narrating GUI events. By combining action-aware spatial and temporal sampling with a captioning model, the approach improves over baseline multimodal models, including GPT-4o, and demonstrates gains when used for fine-tuning open models or as prompting signals in closed models. The study provides a rigorous evaluation methodology using an LLM-based semantic IoU for action-type grounding and reveals the persistent difficulty of GUI action understanding for current models. The contributions offer a practical pathway to enhance GUI automation systems by enabling learning from user demonstrations and improving task-level grounding in dense, high-resolution interfaces.

Abstract

The advent of Multimodal LLMs has significantly enhanced image OCR recognition capabilities, making GUI automation a viable reality for increasing efficiency in digital tasks. One fundamental aspect of developing a GUI automation system is understanding primitive GUI actions. This comprehension is crucial as it enables agents to learn from user demonstrations, an essential element of automation. To rigorously evaluate such capabilities, we developed a video captioning benchmark for GUI actions, comprising 4,189 diverse video captioning samples. This task presents unique challenges compared to natural scene video captioning: 1) GUI screenshots typically contain denser information than natural scenes, and 2) events within GUIs are subtler and occur more rapidly, requiring precise attention to the appropriate time span and spatial region for accurate understanding. To address these challenges, we introduce our GUI action dataset \textbf{Act2Cap} as well as a simple yet effective framework, \textbf{GUI Narrator}, for GUI video captioning that utilizes the cursor as a visual prompt to enhance the interpretation of high-resolution screenshots. Specifically, a cursor detector is trained on our dataset, and a multimodal LLM model with mechanisms for selecting keyframes and key regions generates the captions. Experimental results indicate that even for today's most advanced multimodal models, such as GPT-4o, the task remains highly challenging. Additionally, our evaluations show that our strategy effectively enhances model performance, whether integrated into the fine-tuning of open-source models or employed as a prompting strategy in closed-source models.
Paper Structure (21 sections, 1 equation, 12 figures, 6 tables)

This paper contains 21 sections, 1 equation, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Illustration of GUI Action Narration. Comparing the action narration generated with closed-source models and our result. The green color indicates correct, while the red indicates wrong.
  • Figure 2: Data Collection Pipeline: (left) Our automatic data collection pipeline and (right) manual data collection pipeline.
  • Figure 3: Action distribution in training and test dataset. The left-hand side shows the distribution of training data and the right-hand side demonstrates the test dataset.
  • Figure 4: Overview of GUI Narrator: It first processes sampled frames from the video through a spatial detection model, which locates the cursor, adds visual prompts to the screenshot, and crops the region near the cursor to represent each frame. Subsequently, the temporal detection model identifies further keyframes based on the cropped sub-images. The extracted keyframes, combined with a text query, are then fed into the VLM model, which generates a narration describing the GUI actions.
  • Figure 5: Temporal Detection Model: We implement a frozen ViT encoder from OpenCLIP pre-trained model together with trainable Multihead-Self-Attention Layers.
  • ...and 7 more figures