Semantic GUI Scene Learning and Video Alignment for Detecting Duplicate Video-based Bug Reports
Yanfu Yan, Nathan Cooper, Oscar Chaparro, Kevin Moran, Denys Poshyvanyk
TL;DR
This paper tackles the challenging problem of identifying duplicate video-based GUI bug reports. It introduces JANUS, a multi-modal detector that combines vision-transformer-derived GUI scene representations, robust OCR-based textual content, and a sequential frame-alignment mechanism to capture reproduction patterns. On a large, realistic benchmark extending prior datasets with real bugs, JANUS achieves state-of-the-art performance (mRR 89.8%, mAP 84.7%), outperforming the previous Tango detector by about 8–9% and showing strong gains across diverse apps. The work demonstrates that joint visual-textual-sequential modeling yields interpretable, effective duplicate detection, with practical implications for prioritizing bug triage in mobile app development. It also provides a public benchmark for future research and highlights directions for extending GUI understanding to broader platforms and languages.
Abstract
Video-based bug reports are increasingly being used to document bugs for programs centered around a graphical user interface (GUI). However, developing automated techniques to manage video-based reports is challenging as it requires identifying and understanding often nuanced visual patterns that capture key information about a reported bug. In this paper, we aim to overcome these challenges by advancing the bug report management task of duplicate detection for video-based reports. To this end, we introduce a new approach, called JANUS, that adapts the scene-learning capabilities of vision transformers to capture subtle visual and textual patterns that manifest on app UI screens - which is key to differentiating between similar screens for accurate duplicate report detection. JANUS also makes use of a video alignment technique capable of adaptive weighting of video frames to account for typical bug manifestation patterns. In a comprehensive evaluation on a benchmark containing 7,290 duplicate detection tasks derived from 270 video-based bug reports from 90 Android app bugs, the best configuration of our approach achieves an overall mRR/mAP of 89.8%/84.7%, and for the large majority of duplicate detection tasks, outperforms prior work by around 9% to a statistically significant degree. Finally, we qualitatively illustrate how the scene-learning capabilities provided by Janus benefits its performance.
