SeeAction: Towards Reverse Engineering How-What-Where of HCI Actions from Screencasts for UI Automation

Dehai Zhao; Zhenchang Xing; Qinghua Lu; Xiwei Xu; Liming Zhu

SeeAction: Towards Reverse Engineering How-What-Where of HCI Actions from Screencasts for UI Automation

Dehai Zhao, Zhenchang Xing, Qinghua Lu, Xiwei Xu, Liming Zhu

TL;DR

SeeAction tackles non-intrusive reverse engineering of structured HCI actions from screencasts by introducing a three-stream, multi-task deep model that jointly predicts command, widget, and a location phrase. It defines 11 command and 11 widget classes with a free-form location vocabulary, and leverages three input streams—original frames, cropped change regions, and similarity maps—to extract robust spatio-temporal features. Trained on 7,260 labeled fragments from five desktop apps, SeeAction achieves high command and widget accuracy and solid location-language quality, with demonstrated generalization across applications. A pilot bug-reproduction study shows that a screencast-to-actionscript tool built on SeeAction improves bug reproduction rates compared with relying on traditional steps-to-reproduce text descriptions. These results indicate strong potential for vision-based UI automation, testing, and non-intrusive bug reproduction workflows.

Abstract

UI automation is a useful technique for UI testing, bug reproduction, and robotic process automation. Recording user actions with an application assists rapid development of UI automation scripts, but existing recording techniques are intrusive, rely on OS or GUI framework accessibility support, or assume specific app implementations. Reverse engineering user actions from screencasts is non-intrusive, but a key reverse-engineering step is currently missing - recognizing human-understandable structured user actions ([command] [widget] [location]) from action screencasts. To fill the gap, we propose a deep learning-based computer vision model that can recognize 11 commands and 11 widgets, and generate location phrases from action screencasts, through joint learning and multi-task learning. We label a large dataset with 7260 video-action pairs, which record user interactions with Word, Zoom, Firefox, Photoshop, and Windows 10 Settings. Through extensive experiments, we confirm the effectiveness and generality of our model, and demonstrate the usefulness of a screencast-to-action-script tool built upon our model for bug reproduction.

SeeAction: Towards Reverse Engineering How-What-Where of HCI Actions from Screencasts for UI Automation

TL;DR

Abstract

SeeAction: Towards Reverse Engineering How-What-Where of HCI Actions from Screencasts for UI Automation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)