Automating a Complete Software Test Process Using LLMs: An Automotive Case Study
Shuai Wang, Yinan Yu, Robert Feldt, Dhasarathy Parthasarathy
TL;DR
This work tackles the automation of a largely manual automotive API testing workflow by decomposing the process and employing Large Language Models (LLMs) to automate discrete steps. The authors introduce SPAPI-Tester, a four-stage pipeline that handles documentation understanding, information matching, test-case generation, and execution/reporting, augmented with DSPy-based prompt optimization and structured outputs. Across 41 truck APIs and 193 real-world APIs, SPAPI-Tester achieves high API pass rates (up to ~98%), strong test-case precision, and effective failure detection, while delivering end-to-end automation in roughly 11 seconds per API. The results demonstrate that LLM-driven automation can replace repetitive testing tasks while preserving the process structure, suggesting broad applicability to web-server API testing and potential lifecycle-level automation in automotive software engineering.
Abstract
Vehicle API testing verifies whether the interactions between a vehicle's internal systems and external applications meet expectations, ensuring that users can access and control various vehicle functions and data. However, this task is inherently complex, requiring the alignment and coordination of API systems, communication protocols, and even vehicle simulation systems to develop valid test cases. In practical industrial scenarios, inconsistencies, ambiguities, and interdependencies across various documents and system specifications pose significant challenges. This paper presents a system designed for the automated testing of in-vehicle APIs. By clearly defining and segmenting the testing process, we enable Large Language Models (LLMs) to focus on specific tasks, ensuring a stable and controlled testing workflow. Experiments conducted on over 100 APIs demonstrate that our system effectively automates vehicle API testing. The results also confirm that LLMs can efficiently handle mundane tasks requiring human judgment, making them suitable for complete automation in similar industrial contexts.
