GUI Element Detection Using SOTA YOLO Deep Learning Models
Seyed Shayan Daneshvar, Shaowei Wang
TL;DR
This study benchmarks four recent YOLO models (YOLOv5R7s, YOLOv6R3s, YOLOv7, YOLOv8s) on the VINS GUI element dataset to evaluate GUI element detection performance. Using 300-epoch COCO-pretrained training on 416×416 images, it assesses accuracy via $mAP@0.5$ and $AP[.5:.05:.95]$, finding that YOLOv5R7s excels at $AP@0.5$ while YOLOv7 delivers the best $AP[.5:.05:.95]$; element-wise, Drawers and Switches are easiest, Checked Text Views hardest, with YOLOv8s sensitive to aspect ratios. The study reveals that model rankings on GUI data can diverge from general datasets like COCO and highlights the risk of relying solely on validation performance for model selection. It also discusses limitations and threats to validity and suggests future work on integrating GUI-element detection with code-generation and exploring mobile versus web GUI differences. Overall, the work provides a GUI-specific view of state-of-the-art YOLO performance and practical guidance for GUI-focused detection tasks.
Abstract
Detection of Graphical User Interface (GUI) elements is a crucial task for automatic code generation from images and sketches, GUI testing, and GUI search. Recent studies have leveraged both old-fashioned and modern computer vision (CV) techniques. Oldfashioned methods utilize classic image processing algorithms (e.g. edge detection and contour detection) and modern methods use mature deep learning solutions for general object detection tasks. GUI element detection, however, is a domain-specific case of object detection, in which objects overlap more often, and are located very close to each other, plus the number of object classes is considerably lower, yet there are more objects in the images compared to natural images. Hence, the studies that have been carried out on comparing various object detection models, might not apply to GUI element detection. In this study, we evaluate the performance of the four most recent successful YOLO models for general object detection tasks on GUI element detection and investigate their accuracy performance in detecting various GUI elements.
