Exploring the Capabilities of Vision-Language Models to Detect Visual Bugs in HTML5 <canvas> Applications
Finlay Macklon, Cor-Paul Bezemer
TL;DR
The paper tackles the problem of detecting visual bugs in HTML5 <canvas> applications, where bugs reflect mismatches between expected and actual canvas output and the DOM-based testing paradigm is inadequate. It proposes using Vision-Language Models, notably GPT-4o, with prompting strategies that blend application context (readmes, bug taxonomies) and visual references (bug-free screenshots, assets) to detect bugs without explicit visual test oracles. The authors create a dataset of 100 screenshots from 20 PixiJS-based apps (80 bug-injected, 20 bug-free), develop an end-to-end testing and bug-injection framework, and demonstrate that prompting strategies providing rich context can yield high per-application accuracy, up to 100% for some apps, with state bugs being most detectable. While results vary across applications and bug types, the approach shows promise to reduce manual testing burden and to enable regression testing when combined with multiple outputs per screenshot (pass@$k$), laying groundwork for future fine-tuning or targeted preprocessing to further improve reliability in practical settings.
Abstract
The HyperText Markup Language 5 (HTML5) <canvas> is useful for creating visual-centric web applications. However, unlike traditional web applications, HTML5 <canvas> applications render objects onto the <canvas> bitmap without representing them in the Document Object Model (DOM). Mismatches between the expected and actual visual output of the <canvas> bitmap are termed visual bugs. Due to the visual-centric nature of <canvas> applications, visual bugs are important to detect because such bugs can render a <canvas> application useless. As we showed in prior work, Asset-Based graphics can provide the ground truth for a visual test oracle. However, many <canvas> applications procedurally generate their graphics. In this paper, we investigate how to detect visual bugs in <canvas> applications that use Procedural graphics as well. In particular, we explore the potential of Vision-Language Models (VLMs) to automatically detect visual bugs. Instead of defining an exact visual test oracle, information about the application's expected functionality (the context) can be provided with the screenshot as input to the VLM. To evaluate this approach, we constructed a dataset containing 80 bug-injected screenshots across four visual bug types (Layout, Rendering, Appearance, and State) plus 20 bug-free screenshots from 20 <canvas> applications. We ran experiments with a state-of-the-art VLM using several combinations of text and image context to describe each application's expected functionality. Our results show that by providing the application README(s), a description of visual bug types, and a bug-free screenshot as context, VLMs can be leveraged to detect visual bugs with up to 100% per-application accuracy.
