Table of Contents
Fetching ...

Synthesizing Precise Protocol Specs from Natural Language for Effective Test Generation

Kuangxiangzi Liu, Dhiman Chakraborty, Alexander Liggesmeyer, Andreas Zeller

TL;DR

This work tackles the challenge of turning natural-language protocol specifications into executable formal artifacts for automated test generation in safety- and security-critical systems. It introduces AutoSpec, a two-stage LLM-driven pipeline that first extracts protocol elements from NL RFCs and then synthesizes a formal I/O grammar, refined through an execution-guided repair loop leveraging real protocol implementations. The approach yields high client coverage and solid precision on five internet protocols, while preserving traceability to source text and enabling reusable grammars for future testing. This methodology reduces dependence on end-to-end LLM test generation, enhances reproducibility and auditability, and paves the way for building a corpus of NL-to-formal-spec mappings to bootstrap further automation and tooling in protocol testing.

Abstract

Safety- and security-critical systems have to be thoroughly tested against their specifications. The state of practice is to have _natural language_ specifications, from which test cases are derived manually - a process that is slow, error-prone, and difficult to scale. _Formal_ specifications, on the other hand, are well-suited for automated test generation, but are tedious to write and maintain. In this work, we propose a two-stage pipeline that uses large language models (LLMs) to bridge the gap: First, we extract _protocol elements_ from natural-language specifications; second, leveraging a protocol implementation, we synthesize and refine a formal _protocol specification_ from these elements, which we can then use to massively test further implementations. We see this two-stage approach to be superior to end-to-end LLM-based test generation, as 1. it produces an _inspectable specification_ that preserves traceability to the original text; 2. the generation of actual test cases _no longer requires an LLM_; 3. the resulting formal specs are _human-readable_, and can be reviewed, version-controlled, and incrementally refined; and 4. over time, we can build a _corpus_ of natural-language-to-formal-specification mappings that can be used to further train and refine LLMs for more automatic translations. Our prototype, AUTOSPEC, successfully demonstrated the feasibility of our approach on five widely used _internet protocols_ (SMTP, POP3, IMAP, FTP, and ManageSieve) by applying its methods on their _RFC specifications_ written in natural-language, and the recent _I/O grammar_ formalism for protocol specification and fuzzing. In its evaluation, AUTOSPEC recovers on average 92.8% of client and 80.2% of server message types, and achieves 81.5% message acceptance across diverse, real-world systems.

Synthesizing Precise Protocol Specs from Natural Language for Effective Test Generation

TL;DR

This work tackles the challenge of turning natural-language protocol specifications into executable formal artifacts for automated test generation in safety- and security-critical systems. It introduces AutoSpec, a two-stage LLM-driven pipeline that first extracts protocol elements from NL RFCs and then synthesizes a formal I/O grammar, refined through an execution-guided repair loop leveraging real protocol implementations. The approach yields high client coverage and solid precision on five internet protocols, while preserving traceability to source text and enabling reusable grammars for future testing. This methodology reduces dependence on end-to-end LLM test generation, enhances reproducibility and auditability, and paves the way for building a corpus of NL-to-formal-spec mappings to bootstrap further automation and tooling in protocol testing.

Abstract

Safety- and security-critical systems have to be thoroughly tested against their specifications. The state of practice is to have _natural language_ specifications, from which test cases are derived manually - a process that is slow, error-prone, and difficult to scale. _Formal_ specifications, on the other hand, are well-suited for automated test generation, but are tedious to write and maintain. In this work, we propose a two-stage pipeline that uses large language models (LLMs) to bridge the gap: First, we extract _protocol elements_ from natural-language specifications; second, leveraging a protocol implementation, we synthesize and refine a formal _protocol specification_ from these elements, which we can then use to massively test further implementations. We see this two-stage approach to be superior to end-to-end LLM-based test generation, as 1. it produces an _inspectable specification_ that preserves traceability to the original text; 2. the generation of actual test cases _no longer requires an LLM_; 3. the resulting formal specs are _human-readable_, and can be reviewed, version-controlled, and incrementally refined; and 4. over time, we can build a _corpus_ of natural-language-to-formal-specification mappings that can be used to further train and refine LLMs for more automatic translations. Our prototype, AUTOSPEC, successfully demonstrated the feasibility of our approach on five widely used _internet protocols_ (SMTP, POP3, IMAP, FTP, and ManageSieve) by applying its methods on their _RFC specifications_ written in natural-language, and the recent _I/O grammar_ formalism for protocol specification and fuzzing. In its evaluation, AUTOSPEC recovers on average 92.8% of client and 80.2% of server message types, and achieves 81.5% message acceptance across diverse, real-world systems.

Paper Structure

This paper contains 45 sections, 2 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: From RFC text to I/O Grammar to test trace. We convert the natural-language POP3 protocol specification (a) into a formal I/O grammar (b). Instantiating this protocol specification yields concrete interactions as test cases (c).
  • Figure 2: How AutoSpec works. From protocol elements automatically extracted from an RFC, an LLM (1) produces an I/O grammar as formal protocol specification. A test generator (2) can then use this I/O grammar to test an actual implementation (3). If errors occur (4), a grammar repair agent (5) suggests specification fixes to the LLM, initiating a repair and refinement cycle. The result is a high-quality I/O grammar that can be used for comprehensive testing.
  • Figure 3: RFC preprocessing pipeline. From the raw RFC text, an LLM (1) extracts three element types: state transitions, message syntax, and message constraints. A merger (2) groups commands by state and links commands by their dependencies. The output (3) comprises state-transition paths and structured RFC contents for I/O grammar synthesis.
  • Figure 4: Prompt for states/commands/transitions extraction.
  • Figure 5: LLM output for POP3 USER/PASS (\ref{['fig:prompt2']}).
  • ...and 4 more figures