SpeCrawler: Generating OpenAPI Specifications from API Documentation Using Large Language Models

Koren Lazar; Matan Vetzler; Guy Uziel; David Boaz; Esther Goldbraich; David Amid; Ateret Anaby-Tavor

SpeCrawler: Generating OpenAPI Specifications from API Documentation Using Large Language Models

Koren Lazar, Matan Vetzler, Guy Uziel, David Boaz, Esther Goldbraich, David Amid, Ateret Anaby-Tavor

TL;DR

SpeCrawler is introduced, a comprehensive system that utilizes large language models (LLMs) to generate OpenAPI Specifications from diverse API documentation through a carefully crafted pipeline, aiding in streamlining integration processes within API orchestrating systems and facilitating the incorporation of tools into LLMs.

Abstract

In the digital era, the widespread use of APIs is evident. However, scalable utilization of APIs poses a challenge due to structure divergence observed in online API documentation. This underscores the need for automatic tools to facilitate API consumption. A viable approach involves the conversion of documentation into an API Specification format. While previous attempts have been made using rule-based methods, these approaches encountered difficulties in generalizing across diverse documentation. In this paper we introduce SpeCrawler, a comprehensive system that utilizes large language models (LLMs) to generate OpenAPI Specifications from diverse API documentation through a carefully crafted pipeline. By creating a standardized format for numerous APIs, SpeCrawler aids in streamlining integration processes within API orchestrating systems and facilitating the incorporation of tools into LLMs. The paper explores SpeCrawler's methodology, supported by empirical evidence and case studies, demonstrating its efficacy through LLM capabilities.

SpeCrawler: Generating OpenAPI Specifications from API Documentation Using Large Language Models

TL;DR

Abstract

Paper Structure (16 sections, 5 figures, 4 tables)

This paper contains 16 sections, 5 figures, 4 tables.

Introduction
OpenAPIs and Documentation Websites
SpeCrawler
Scraping
Base OAS Generation
Enrichment
Experiments
Base OAS Generation
Enrichment Generation
End-to-End Testing
Comparing Against LLM-Based Solutions
Related Work
Conclusions
Appendix
Find Minimal Ancestor Algorithm
...and 1 more sections

Figures (5)

Figure 1: Illustration of an OpenAPI Specification schema structure, highlighting the hierarchical arrangement of components like paths, parameters, responses, and other elements that define RESTful APIs. Typically, OpenAPI Specification files are stored in YAML or JSON formats.
Figure 2: SpeCrawler Architecture: This diagram demonstrates the carefully designed steps involved in transforming REST API documentation into an accurate OpenAPI specification (OAS). The process begins by extracting pairs of request and response elements from the API documentation webpage, which serve as the foundation for creating a skeletal OAS, as described in details in \ref{['subsec:openapi-spec-generation']}. Then, the descriptive section of the documentation is used to gather comprehensive details about the API, its request and response elements, and their parameters, as described in \ref{['subsec:enrichment-specrawler']}. Subsequently, the outputs from both processes are integrated to form a comprehensive OAS. Both procedures rely on large language models for generation.
Figure 3: API documentation webpage - This diagram illustrates a typical structure found in online API documentation webpages, including its key components. On the left side, you'll find the reference-based documentation, which primarily comprises descriptive text explaining the API, its request and response elements, parameters, and additional metadata. On the right side, the example-style documentation section provides practical demonstrations of API interaction, including common request examples, and sample responses users can anticipate. The API's HTTP method and URL is commonly featured on either side.
Figure 4: Labeled data - This figure provides a visual representation of input-output pairs from the enrichment generation stage, sourced from PayPal Developer and Amplitude APIs. The top section displays examples of a request element, while the bottom section displays a response element example. On the left side of the figure, you'll find input sources, which consist of scraped raw HTML scopes from API documentation websites. On the right side, the results of the enrichment generation process are presented. For request elements, the results are formatted as a TSV table, while for response elements, they are showcased as a response OpenAPI schema nested object.
Figure 5: An example of a prompt used for generating a TSV table for the task of request enrichment with a single in-context example.

SpeCrawler: Generating OpenAPI Specifications from API Documentation Using Large Language Models

TL;DR

Abstract

SpeCrawler: Generating OpenAPI Specifications from API Documentation Using Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)