Table of Contents
Fetching ...

Computer User Interface Understanding. A New Dataset and a Learning Framework

Andrés Muñoz, Daniel Borrajo

TL;DR

This work tackles computer UI understanding by treating the screen as a state descriptor and introducing the DataVisualWorkflow dataset, which captures complex desktop workflows with diverse software and contextual cues. The authors propose UIMTCon, a semi-supervised learning framework that combines a synthetic data generator with multi-task contrastive learning across a software-view-context label hierarchy, using three projection heads and a Split Hierarchy Loss. Experimental results on DataVisualWorkflow show that UIMTCon, particularly at three hierarchy levels, improves retrieval and clustering performance versus baselines and is robust to label noise introduced by synthetic data, while enabling strong in-distribution and competitive out-of-distribution generalization. The dataset and framework jointly advance unsupervised and semi-supervised UI representation learning and hold promise for enterprise workflow automation and cross-application automation tasks.

Abstract

User Interface (UI) understanding has been an increasingly popular topic over the last few years. So far, there has been a vast focus solely on web and mobile applications. In this paper, we introduce the harder task of computer UI understanding. With the goal of enabling research in this field, we have generated a dataset with a set of videos where a user is performing a sequence of actions and each image shows the desktop contents at that time point. We also present a framework that is composed of a synthetic sample generation pipeline to augment the dataset with relevant characteristics, and a contrastive learning method to classify images in the videos. We take advantage of the natural conditional, tree-like, relationship of the images' characteristics to regularize the learning of the representations by dealing with multiple partial tasks simultaneously. Experimental results show that the proposed framework outperforms previously proposed hierarchical multi-label contrastive losses in fine-grain UI classification.

Computer User Interface Understanding. A New Dataset and a Learning Framework

TL;DR

This work tackles computer UI understanding by treating the screen as a state descriptor and introducing the DataVisualWorkflow dataset, which captures complex desktop workflows with diverse software and contextual cues. The authors propose UIMTCon, a semi-supervised learning framework that combines a synthetic data generator with multi-task contrastive learning across a software-view-context label hierarchy, using three projection heads and a Split Hierarchy Loss. Experimental results on DataVisualWorkflow show that UIMTCon, particularly at three hierarchy levels, improves retrieval and clustering performance versus baselines and is robust to label noise introduced by synthetic data, while enabling strong in-distribution and competitive out-of-distribution generalization. The dataset and framework jointly advance unsupervised and semi-supervised UI representation learning and hold promise for enterprise workflow automation and cross-application automation tasks.

Abstract

User Interface (UI) understanding has been an increasingly popular topic over the last few years. So far, there has been a vast focus solely on web and mobile applications. In this paper, we introduce the harder task of computer UI understanding. With the goal of enabling research in this field, we have generated a dataset with a set of videos where a user is performing a sequence of actions and each image shows the desktop contents at that time point. We also present a framework that is composed of a synthetic sample generation pipeline to augment the dataset with relevant characteristics, and a contrastive learning method to classify images in the videos. We take advantage of the natural conditional, tree-like, relationship of the images' characteristics to regularize the learning of the representations by dealing with multiple partial tasks simultaneously. Experimental results show that the proposed framework outperforms previously proposed hierarchical multi-label contrastive losses in fine-grain UI classification.
Paper Structure (24 sections, 2 equations, 11 figures, 7 tables, 1 algorithm)

This paper contains 24 sections, 2 equations, 11 figures, 7 tables, 1 algorithm.

Figures (11)

  • Figure 1: Sample images taken from DataVisualWorkflow.
  • Figure 2: Samples of sequences taken from DataVisualWorkflow.
  • Figure 3: Distribution of long-form actions and random actions in DataVisualWorkflow.
  • Figure 4: (a) Illustration of the full architecture of the system. (b) Shows the standard architecture of a contrastive learning task $\text{FC}_{\text{svc}}$. (c) Illustration of the proposed method, $\text{FC}_{\text{s}}$, $\text{FC}_{\text{sv}}$ and $\text{FC}_{\text{svc}}$ are the projection heads for software, software-view and software-view-context tasks, respectively. The red line symbolizes the path we are interested in for inference, while all other paths are used for training only.
  • Figure 5: Synthetic Generator samples. (a) Shows a few samples from the synthetic menu generator. (b) Presents samples generated by the selected text generator.
  • ...and 6 more figures