Enabling Cost-Effective UI Automation Testing with Retrieval-Based LLMs: A Case Study in WeChat
Sidong Feng, Haochuan Lu, Jianqin Jiang, Ting Xiong, Likun Huang, Yinglin Liang, Xiaoqin Li, Yuetang Deng, Aldeida Aleti
TL;DR
The paper tackles the high cost and knowledge gaps of industrial UI automation by introducing CAT, a cost-aware pipeline that combines Retrieval Augmented Generation to fetch relevant industrial usage examples with an LLM-based optimizer for UI element mapping. It presents a two-phase approach: first decomposing high-level task descriptions into executable actions using RAG and few-shot context, then executing those actions by mapping to dynamic UI elements and refining with LLM reasoning. Through extensive evaluation on the WeChat dataset (tens of thousands of tasks) and real-world integration, CAT achieves 90% task completion at a modest cost of 0.34 per test and detects 141 bugs in production, outperforming state-of-the-art baselines while reducing cost dramatically. The work demonstrates that a hybrid ML and LLM framework can make industrial-level UI automation scalable and cost-effective, with practical impact for ongoing software testing at large apps like WeChat.
Abstract
UI automation tests play a crucial role in ensuring the quality of mobile applications. Despite the growing popularity of machine learning techniques to generate these tests, they still face several challenges, such as the mismatch of UI elements. The recent advances in Large Language Models (LLMs) have addressed these issues by leveraging their semantic understanding capabilities. However, a significant gap remains in applying these models to industrial-level app testing, particularly in terms of cost optimization and knowledge limitation. To address this, we introduce CAT to create cost-effective UI automation tests for industry apps by combining machine learning and LLMs with best practices. Given the task description, CAT employs Retrieval Augmented Generation (RAG) to source examples of industrial app usage as the few-shot learning context, assisting LLMs in generating the specific sequence of actions. CAT then employs machine learning techniques, with LLMs serving as a complementary optimizer, to map the target element on the UI screen. Our evaluations on the WeChat testing dataset demonstrate the CAT's performance and cost-effectiveness, achieving 90% UI automation with $0.34 cost, outperforming the state-of-the-art. We have also integrated our approach into the real-world WeChat testing platform, demonstrating its usefulness in detecting 141 bugs and enhancing the developers' testing process.
