Table of Contents
Fetching ...

Exploring the Capabilities of Vision-Language Models to Detect Visual Bugs in HTML5 <canvas> Applications

Finlay Macklon, Cor-Paul Bezemer

TL;DR

The paper tackles the problem of detecting visual bugs in HTML5 <canvas> applications, where bugs reflect mismatches between expected and actual canvas output and the DOM-based testing paradigm is inadequate. It proposes using Vision-Language Models, notably GPT-4o, with prompting strategies that blend application context (readmes, bug taxonomies) and visual references (bug-free screenshots, assets) to detect bugs without explicit visual test oracles. The authors create a dataset of 100 screenshots from 20 PixiJS-based apps (80 bug-injected, 20 bug-free), develop an end-to-end testing and bug-injection framework, and demonstrate that prompting strategies providing rich context can yield high per-application accuracy, up to 100% for some apps, with state bugs being most detectable. While results vary across applications and bug types, the approach shows promise to reduce manual testing burden and to enable regression testing when combined with multiple outputs per screenshot (pass@$k$), laying groundwork for future fine-tuning or targeted preprocessing to further improve reliability in practical settings.

Abstract

The HyperText Markup Language 5 (HTML5) <canvas> is useful for creating visual-centric web applications. However, unlike traditional web applications, HTML5 <canvas> applications render objects onto the <canvas> bitmap without representing them in the Document Object Model (DOM). Mismatches between the expected and actual visual output of the <canvas> bitmap are termed visual bugs. Due to the visual-centric nature of <canvas> applications, visual bugs are important to detect because such bugs can render a <canvas> application useless. As we showed in prior work, Asset-Based graphics can provide the ground truth for a visual test oracle. However, many <canvas> applications procedurally generate their graphics. In this paper, we investigate how to detect visual bugs in <canvas> applications that use Procedural graphics as well. In particular, we explore the potential of Vision-Language Models (VLMs) to automatically detect visual bugs. Instead of defining an exact visual test oracle, information about the application's expected functionality (the context) can be provided with the screenshot as input to the VLM. To evaluate this approach, we constructed a dataset containing 80 bug-injected screenshots across four visual bug types (Layout, Rendering, Appearance, and State) plus 20 bug-free screenshots from 20 <canvas> applications. We ran experiments with a state-of-the-art VLM using several combinations of text and image context to describe each application's expected functionality. Our results show that by providing the application README(s), a description of visual bug types, and a bug-free screenshot as context, VLMs can be leveraged to detect visual bugs with up to 100% per-application accuracy.

Exploring the Capabilities of Vision-Language Models to Detect Visual Bugs in HTML5 <canvas> Applications

TL;DR

The paper tackles the problem of detecting visual bugs in HTML5 <canvas> applications, where bugs reflect mismatches between expected and actual canvas output and the DOM-based testing paradigm is inadequate. It proposes using Vision-Language Models, notably GPT-4o, with prompting strategies that blend application context (readmes, bug taxonomies) and visual references (bug-free screenshots, assets) to detect bugs without explicit visual test oracles. The authors create a dataset of 100 screenshots from 20 PixiJS-based apps (80 bug-injected, 20 bug-free), develop an end-to-end testing and bug-injection framework, and demonstrate that prompting strategies providing rich context can yield high per-application accuracy, up to 100% for some apps, with state bugs being most detectable. While results vary across applications and bug types, the approach shows promise to reduce manual testing burden and to enable regression testing when combined with multiple outputs per screenshot (pass@), laying groundwork for future fine-tuning or targeted preprocessing to further improve reliability in practical settings.

Abstract

The HyperText Markup Language 5 (HTML5) <canvas> is useful for creating visual-centric web applications. However, unlike traditional web applications, HTML5 <canvas> applications render objects onto the <canvas> bitmap without representing them in the Document Object Model (DOM). Mismatches between the expected and actual visual output of the <canvas> bitmap are termed visual bugs. Due to the visual-centric nature of <canvas> applications, visual bugs are important to detect because such bugs can render a <canvas> application useless. As we showed in prior work, Asset-Based graphics can provide the ground truth for a visual test oracle. However, many <canvas> applications procedurally generate their graphics. In this paper, we investigate how to detect visual bugs in <canvas> applications that use Procedural graphics as well. In particular, we explore the potential of Vision-Language Models (VLMs) to automatically detect visual bugs. Instead of defining an exact visual test oracle, information about the application's expected functionality (the context) can be provided with the screenshot as input to the VLM. To evaluate this approach, we constructed a dataset containing 80 bug-injected screenshots across four visual bug types (Layout, Rendering, Appearance, and State) plus 20 bug-free screenshots from 20 <canvas> applications. We ran experiments with a state-of-the-art VLM using several combinations of text and image context to describe each application's expected functionality. Our results show that by providing the application README(s), a description of visual bug types, and a bug-free screenshot as context, VLMs can be leveraged to detect visual bugs with up to 100% per-application accuracy.
Paper Structure (45 sections, 2 equations, 8 figures, 6 tables)

This paper contains 45 sections, 2 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Two screenshots from a Breakthrough clone (ourcade/ecs-dependency-injection in Table \ref{['tab:subsetcanvasapplications']}) coupled with the (correct) visual bug detection results generated with GPT-4o while utilizing prompting strategy AllContextExceptAssets from Table \ref{['tab:promptstrategies']}.
  • Figure 2: Overview of our dataset construction.
  • Figure 3: Four bug-free screenshots ((a), (c), (e), (g)) and four bug-injected screenshots ((b), (d), (f), (h)) of <canvas> applications collected using our custom framework. Each bug-injected screenshot is paired with a description of the injected visual bug.
  • Figure 4: Distributions of accuracy (%) yielded in experiments using VLMs to detect visual bugs with various prompting strategies. Accuracy is computed over the set of 20 applications. The bug-free accuracies are shown as scatter plots with bug-injected accuracies shown as box plots on a shared set of axes. There are four bug-free accuracy values per prompting strategy. Each box plot represents a distribution of 16 values (four bug types multiplied by four repetitions).
  • Figure 5: Distributions of precision (%) and recall (%) per bug type when using VLMs to detect visual bugs with the prompting strategy AllContextExceptAssets. Precision and recall are computed over the set of 20 applications. Precision and recall are represented on a shared set of axes as scatter plots with point clouds representing the ranges of values. There are four precision values and four recall values per bug type (from four repetitions of experiments).
  • ...and 3 more figures