GUI Element Detection Using SOTA YOLO Deep Learning Models

Seyed Shayan Daneshvar; Shaowei Wang

GUI Element Detection Using SOTA YOLO Deep Learning Models

Seyed Shayan Daneshvar, Shaowei Wang

TL;DR

This study benchmarks four recent YOLO models (YOLOv5R7s, YOLOv6R3s, YOLOv7, YOLOv8s) on the VINS GUI element dataset to evaluate GUI element detection performance. Using 300-epoch COCO-pretrained training on 416×416 images, it assesses accuracy via $mAP@0.5$ and $AP[.5:.05:.95]$, finding that YOLOv5R7s excels at $AP@0.5$ while YOLOv7 delivers the best $AP[.5:.05:.95]$; element-wise, Drawers and Switches are easiest, Checked Text Views hardest, with YOLOv8s sensitive to aspect ratios. The study reveals that model rankings on GUI data can diverge from general datasets like COCO and highlights the risk of relying solely on validation performance for model selection. It also discusses limitations and threats to validity and suggests future work on integrating GUI-element detection with code-generation and exploring mobile versus web GUI differences. Overall, the work provides a GUI-specific view of state-of-the-art YOLO performance and practical guidance for GUI-focused detection tasks.

Abstract

Detection of Graphical User Interface (GUI) elements is a crucial task for automatic code generation from images and sketches, GUI testing, and GUI search. Recent studies have leveraged both old-fashioned and modern computer vision (CV) techniques. Oldfashioned methods utilize classic image processing algorithms (e.g. edge detection and contour detection) and modern methods use mature deep learning solutions for general object detection tasks. GUI element detection, however, is a domain-specific case of object detection, in which objects overlap more often, and are located very close to each other, plus the number of object classes is considerably lower, yet there are more objects in the images compared to natural images. Hence, the studies that have been carried out on comparing various object detection models, might not apply to GUI element detection. In this study, we evaluate the performance of the four most recent successful YOLO models for general object detection tasks on GUI element detection and investigate their accuracy performance in detecting various GUI elements.

GUI Element Detection Using SOTA YOLO Deep Learning Models

TL;DR

and

, finding that YOLOv5R7s excels at

while YOLOv7 delivers the best

; element-wise, Drawers and Switches are easiest, Checked Text Views hardest, with YOLOv8s sensitive to aspect ratios. The study reveals that model rankings on GUI data can diverge from general datasets like COCO and highlights the risk of relying solely on validation performance for model selection. It also discusses limitations and threats to validity and suggests future work on integrating GUI-element detection with code-generation and exploring mobile versus web GUI differences. Overall, the work provides a GUI-specific view of state-of-the-art YOLO performance and practical guidance for GUI-focused detection tasks.

Abstract

Paper Structure (14 sections, 4 figures, 6 tables)

This paper contains 14 sections, 4 figures, 6 tables.

Introduction
Empirical Study
Research Questions
Experiment Setup
Dataset
Model Selection
Training Process
Results - RQ1 Accuracy Performance
Results - RQ2 Element Detection Difficulty
Results - RQ3 Verification
Related Work
Limitations
Threats to Validity
Conclusion and Future Work

Figures (4)

Figure 1: Distribution of selected labels in the dataset.
Figure 2: Confusion matrices on the test set. (a) YOLOv5R7s (b) YOLOv6R3s (c) YOLOv7 (d) YOLOv8s
Figure 3: A Cropped Sample Image and its labels (a) Cropped Image (b) Cropped Image with Original Labels
Figure 4: Models' results on a test sample. (a) Original Image (b) Ground-Truth (c) YOLOv5R7s (d) YOLOv6R3s (e) YOLOv7, (f) YOLOv8s

GUI Element Detection Using SOTA YOLO Deep Learning Models

TL;DR

Abstract

GUI Element Detection Using SOTA YOLO Deep Learning Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)