Table of Contents
Fetching ...

VLM-Fuzz: Vision Language Model Assisted Recursive Depth-first Search Exploration for Effective UI Testing of Android Apps

Biniam Fisseha Demissie, Yan Naing Tun, Lwin Khin Shar, Mariano Ceccato

TL;DR

VLM-Fuzz tackles the persistent problem of low code coverage in Android UI testing by fusing a Vision Language Model with a heuristic DFS-based exploration driven by static manifest analysis and dynamic UI hierarchies. The approach computes component budgets from UI element counts, uses explicit component launches, and applies VLM-guided input sequences (with a solid non-vision fallback) to traverse complex screens and pop-ups, while recording transitions to ensure backtracking capability. Empirical evaluation on 59 benchmark apps shows improvements over state-of-the-art tools in class, method, and line coverage, with average gains of 68.5%, 53.2%, and 46.5% respectively, and a real-world bug-detection capability demonstrated across 80 Play Store apps (208 unique crashes). An ablation study confirms the value of the VLM component for complex GUIs, and the approach is shown to be cost-aware and scalable, offering a practical tool for automated Android UI testing and bug discovery.

Abstract

Testing Android apps effectively requires a systematic exploration of the app's possible states by simulating user interactions and system events. While existing approaches have proposed several fuzzing techniques to generate various text inputs and trigger user and system events for UI state exploration, achieving high code coverage remains a significant challenge in Android app testing. The main challenges are (1) reasoning about the complex and dynamic layout of UI screens; (2) generating required inputs/events to deal with certain widgets like pop-ups; and (3) coordination between current test inputs and previous inputs to avoid getting stuck in the same UI screen without improving test coverage. To address these problems, we propose a novel, automated fuzzing approach called VLM-Fuzz for effective UI testing of Android apps. We present a novel heuristic-based depth-first search (DFS) exploration algorithm, assisted with a vision language model (VLM), to effectively explore the UI states of the app. We use static analysis to analyze the Android Manifest file and the runtime UI hierarchy XML to extract the list of components, intent-filters and interactive UI widgets. VLM is used to reason about complex UI layout and widgets on an on-demand basis. Based on the inputs from static analysis, VLM, and the current UI state, we use some heuristics to deal with the above-mentioned challenges. We evaluated VLM-Fuzz based on a benchmark containing 59 apps obtained from a recent work and compared it against two state-of-the-art approaches: APE and DeepGUI. VLM-Fuzz outperforms the best baseline by 9.0%, 3.7%, and 2.1% in terms of class coverage, method coverage, and line coverage, respectively. We also ran VLM-Fuzz on 80 recent Google Play apps (i.e., updated in 2024). VLM-Fuzz detected 208 unique crashes in 24 apps, which have been reported to respective developers.

VLM-Fuzz: Vision Language Model Assisted Recursive Depth-first Search Exploration for Effective UI Testing of Android Apps

TL;DR

VLM-Fuzz tackles the persistent problem of low code coverage in Android UI testing by fusing a Vision Language Model with a heuristic DFS-based exploration driven by static manifest analysis and dynamic UI hierarchies. The approach computes component budgets from UI element counts, uses explicit component launches, and applies VLM-guided input sequences (with a solid non-vision fallback) to traverse complex screens and pop-ups, while recording transitions to ensure backtracking capability. Empirical evaluation on 59 benchmark apps shows improvements over state-of-the-art tools in class, method, and line coverage, with average gains of 68.5%, 53.2%, and 46.5% respectively, and a real-world bug-detection capability demonstrated across 80 Play Store apps (208 unique crashes). An ablation study confirms the value of the VLM component for complex GUIs, and the approach is shown to be cost-aware and scalable, offering a practical tool for automated Android UI testing and bug discovery.

Abstract

Testing Android apps effectively requires a systematic exploration of the app's possible states by simulating user interactions and system events. While existing approaches have proposed several fuzzing techniques to generate various text inputs and trigger user and system events for UI state exploration, achieving high code coverage remains a significant challenge in Android app testing. The main challenges are (1) reasoning about the complex and dynamic layout of UI screens; (2) generating required inputs/events to deal with certain widgets like pop-ups; and (3) coordination between current test inputs and previous inputs to avoid getting stuck in the same UI screen without improving test coverage. To address these problems, we propose a novel, automated fuzzing approach called VLM-Fuzz for effective UI testing of Android apps. We present a novel heuristic-based depth-first search (DFS) exploration algorithm, assisted with a vision language model (VLM), to effectively explore the UI states of the app. We use static analysis to analyze the Android Manifest file and the runtime UI hierarchy XML to extract the list of components, intent-filters and interactive UI widgets. VLM is used to reason about complex UI layout and widgets on an on-demand basis. Based on the inputs from static analysis, VLM, and the current UI state, we use some heuristics to deal with the above-mentioned challenges. We evaluated VLM-Fuzz based on a benchmark containing 59 apps obtained from a recent work and compared it against two state-of-the-art approaches: APE and DeepGUI. VLM-Fuzz outperforms the best baseline by 9.0%, 3.7%, and 2.1% in terms of class coverage, method coverage, and line coverage, respectively. We also ran VLM-Fuzz on 80 recent Google Play apps (i.e., updated in 2024). VLM-Fuzz detected 208 unique crashes in 24 apps, which have been reported to respective developers.

Paper Structure

This paper contains 8 sections, 3 figures, 1 table, 2 algorithms.

Figures (3)

  • Figure 1: An example of UI challenging for typical UI fuzzers (shown with VLM-Fuzz generated input).
  • Figure 2: Example, complex UIs showing (a) a UI overlaid on top of another UI; (b) a UI where complex, valid input is required; (c) a UI where a particular sequence of events is required.
  • Figure 3: Approach overview