Table of Contents
Fetching ...

Model-Enhanced LLM-Driven VUI Testing of VPA Apps

Suwan Li, Lei Bu, Guangdong Bai, Fuman Xie, Kai Chen, Chang Yue

TL;DR

VPA apps rely on a voice-only interface, making traditional GUI-based testing difficult and raising security, privacy, and quality concerns. Elevate introduces a model-enhanced LLM-driven VUI testing framework that constructs a behavior model on-the-fly through three phases—states extraction, input events generation, and state space exploration—guided by prompts and feedback to compensate for semantic loss in model-based testing. Across 4,000 Alexa skills, Elevate achieves higher semantic state coverage and greater efficiency than the state-of-the-art Vitas and GPT-4 chatbot baselines, and demonstrates applicability across LLMs (e.g., Llama2-70b-chat) with an estimated 15–30% coverage advantage in various categories. The approach highlights the potential of integrating LLM reasoning and chain-of-thought with dynamic MBT to robustly test VPA applications at scale, offering practical benefits for improving the reliability and safety of VPA ecosystems.

Abstract

The flourishing ecosystem centered around voice personal assistants (VPA), such as Amazon Alexa, has led to the booming of VPA apps. The largest app market Amazon skills store, for example, hosts over 200,000 apps. Despite their popularity, the open nature of app release and the easy accessibility of apps also raise significant concerns regarding security, privacy and quality. Consequently, various testing approaches have been proposed to systematically examine VPA app behaviors. To tackle the inherent lack of a visible user interface in the VPA app, two strategies are employed during testing, i.e., chatbot-style testing and model-based testing. The former often lacks effective guidance for expanding its search space, while the latter falls short in interpreting the semantics of conversations to construct precise and comprehensive behavior models for apps. In this work, we introduce Elevate, a model-enhanced large language model (LLM)-driven VUI testing framework. Elevate leverages LLMs' strong capability in natural language processing to compensate for semantic information loss during model-based VUI testing. It operates by prompting LLMs to extract states from VPA apps' outputs and generate context-related inputs. During the automatic interactions with the app, it incrementally constructs the behavior model, which facilitates the LLM in generating inputs that are highly likely to discover new states. Elevate bridges the LLM and the behavior model with innovative techniques such as encoding behavior model into prompts and selecting LLM-generated inputs based on the context relevance. Elevate is benchmarked on 4,000 real-world Alexa skills, against the state-of-the-art tester Vitas. It achieves 15% higher state space coverage compared to Vitas on all types of apps, and exhibits significant advancement in efficiency.

Model-Enhanced LLM-Driven VUI Testing of VPA Apps

TL;DR

VPA apps rely on a voice-only interface, making traditional GUI-based testing difficult and raising security, privacy, and quality concerns. Elevate introduces a model-enhanced LLM-driven VUI testing framework that constructs a behavior model on-the-fly through three phases—states extraction, input events generation, and state space exploration—guided by prompts and feedback to compensate for semantic loss in model-based testing. Across 4,000 Alexa skills, Elevate achieves higher semantic state coverage and greater efficiency than the state-of-the-art Vitas and GPT-4 chatbot baselines, and demonstrates applicability across LLMs (e.g., Llama2-70b-chat) with an estimated 15–30% coverage advantage in various categories. The approach highlights the potential of integrating LLM reasoning and chain-of-thought with dynamic MBT to robustly test VPA applications at scale, offering practical benefits for improving the reliability and safety of VPA ecosystems.

Abstract

The flourishing ecosystem centered around voice personal assistants (VPA), such as Amazon Alexa, has led to the booming of VPA apps. The largest app market Amazon skills store, for example, hosts over 200,000 apps. Despite their popularity, the open nature of app release and the easy accessibility of apps also raise significant concerns regarding security, privacy and quality. Consequently, various testing approaches have been proposed to systematically examine VPA app behaviors. To tackle the inherent lack of a visible user interface in the VPA app, two strategies are employed during testing, i.e., chatbot-style testing and model-based testing. The former often lacks effective guidance for expanding its search space, while the latter falls short in interpreting the semantics of conversations to construct precise and comprehensive behavior models for apps. In this work, we introduce Elevate, a model-enhanced large language model (LLM)-driven VUI testing framework. Elevate leverages LLMs' strong capability in natural language processing to compensate for semantic information loss during model-based VUI testing. It operates by prompting LLMs to extract states from VPA apps' outputs and generate context-related inputs. During the automatic interactions with the app, it incrementally constructs the behavior model, which facilitates the LLM in generating inputs that are highly likely to discover new states. Elevate bridges the LLM and the behavior model with innovative techniques such as encoding behavior model into prompts and selecting LLM-generated inputs based on the context relevance. Elevate is benchmarked on 4,000 real-world Alexa skills, against the state-of-the-art tester Vitas. It achieves 15% higher state space coverage compared to Vitas on all types of apps, and exhibits significant advancement in efficiency.
Paper Structure (20 sections, 11 figures, 4 tables)

This paper contains 20 sections, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Lack of semantic information impacts the testing efficiency.
  • Figure 2: LLMs can generate redundant and repeated results if prompts are not carefully designed.
  • Figure 3: The framework of Elevate.
  • Figure 4: The workflow of the state filter.
  • Figure 5: The workflow of the input checker.
  • ...and 6 more figures