Table of Contents
Fetching ...

MASTEST: A LLM-Based Multi-Agent System For RESTful API Tests

Xiaoke Han, Hong Zhu

TL;DR

MASTEST presents a novel LLM-based multi-agent system for end-to-end RESTful API testing that unites intelligent agents and programmed components to cover the full workflow from OpenAPI specifications to executable PyTest scripts, run them against services, and analyze responses with human-in-the-loop quality control. By decomposing the testing process into granular tasks and providing GUI-based review interfaces, it addresses LLM hallucinations and error propagation while preserving automation benefits. Across five public APIs and two LLMs, MASTEST demonstrates high unit and system test scenario coverage, strong script syntax correctness, and competitive data-type accuracy, with DeepSeek excelling in data-type correctness and status-code coverage and GPT-4o performing well on operation coverage. The approach shows practical feasibility for scalable, AI-assisted API testing and highlights actionable insights for tool development, human-in-the-loop QA, and future expansion to broader APIs and data-realistic test generation.

Abstract

Testing RESTful API is increasingly important in quality assurance of cloud-native applications. Recent advances in machine learning (ML) techniques have demonstrated that various testing activities can be performed automatically by large language models (LLMs) with reasonable accuracy. This paper develops a multi-agent system called MASTEST that combines LLM-based and programmed agents to form a complete tool chain that covers the whole workflow of API test starting from generating unit and system test scenarios from API specification in the OpenAPI Swagger format, to generating of Pytest test scripts, executing test scripts to interact with web services, to analysing web service response messages to determine test correctness and calculate test coverage. The system also supports the incorporation of human testers in reviewing and correcting LLM generated test artefacts to ensure the quality of testing activities. MASTEST system is evaluated on two LLMs, GPT-4o and DeepSeek V3.1 Reasoner with five public APIs. The performances of LLMs on various testing activities are measured by a wide range of metrics, including unit and system test scenario coverage and API operation coverage for the quality of generated test scenarios, data type correctness, status code coverage and script syntax correctness for the quality of LLM generated test scripts, as well as bug detection ability and usability of LLM generated test scenarios and scripts. Experiment results demonstrated that both DeepSeek and GPT-4o achieved a high overall performance. DeepSeek excels in data type correctness and status code detection, while GPT-4o performs best in API operation coverage. For both models, LLM generated test scripts maintained 100\% syntax correctness and only required minimal manual edits for semantic correctness. These findings indicate the effectiveness and feasibility of MASTEST.

MASTEST: A LLM-Based Multi-Agent System For RESTful API Tests

TL;DR

MASTEST presents a novel LLM-based multi-agent system for end-to-end RESTful API testing that unites intelligent agents and programmed components to cover the full workflow from OpenAPI specifications to executable PyTest scripts, run them against services, and analyze responses with human-in-the-loop quality control. By decomposing the testing process into granular tasks and providing GUI-based review interfaces, it addresses LLM hallucinations and error propagation while preserving automation benefits. Across five public APIs and two LLMs, MASTEST demonstrates high unit and system test scenario coverage, strong script syntax correctness, and competitive data-type accuracy, with DeepSeek excelling in data-type correctness and status-code coverage and GPT-4o performing well on operation coverage. The approach shows practical feasibility for scalable, AI-assisted API testing and highlights actionable insights for tool development, human-in-the-loop QA, and future expansion to broader APIs and data-realistic test generation.

Abstract

Testing RESTful API is increasingly important in quality assurance of cloud-native applications. Recent advances in machine learning (ML) techniques have demonstrated that various testing activities can be performed automatically by large language models (LLMs) with reasonable accuracy. This paper develops a multi-agent system called MASTEST that combines LLM-based and programmed agents to form a complete tool chain that covers the whole workflow of API test starting from generating unit and system test scenarios from API specification in the OpenAPI Swagger format, to generating of Pytest test scripts, executing test scripts to interact with web services, to analysing web service response messages to determine test correctness and calculate test coverage. The system also supports the incorporation of human testers in reviewing and correcting LLM generated test artefacts to ensure the quality of testing activities. MASTEST system is evaluated on two LLMs, GPT-4o and DeepSeek V3.1 Reasoner with five public APIs. The performances of LLMs on various testing activities are measured by a wide range of metrics, including unit and system test scenario coverage and API operation coverage for the quality of generated test scenarios, data type correctness, status code coverage and script syntax correctness for the quality of LLM generated test scripts, as well as bug detection ability and usability of LLM generated test scenarios and scripts. Experiment results demonstrated that both DeepSeek and GPT-4o achieved a high overall performance. DeepSeek excels in data type correctness and status code detection, while GPT-4o performs best in API operation coverage. For both models, LLM generated test scripts maintained 100\% syntax correctness and only required minimal manual edits for semantic correctness. These findings indicate the effectiveness and feasibility of MASTEST.

Paper Structure

This paper contains 31 sections, 9 equations, 15 figures, 11 tables.

Figures (15)

  • Figure 1: Workflow and MASTEST Architecture.
  • Figure 2: Prompt Template for Generating Test Scripts.
  • Figure 3: Interface of API Operation Inspector.
  • Figure 4: System Structure.
  • Figure 5: Number of Bugs Detected by GPT-4o and DeepSeek.
  • ...and 10 more figures