Table of Contents
Fetching ...

Inferring Input Grammars from Code with Symbolic Parsing

Leon Bettscheider, Andreas Zeller

TL;DR

This paper introduces STALAGMITE, the first static approach to mining input grammars directly from the code of recursive-descent parsers without requiring sample inputs. By performing symbolic execution with bounded loops and recursion, recording all input accesses, and generalizing traces into a context-aware grammar, STALAGMITE produces accurate, readable grammars that cover the entire input space. The method demonstrates high precision and recall (often 99–100%) across diverse parsers and uncovers subtle parser bugs through grammar evaluation. The work advances test input generation, reverse engineering, and documentation by providing a scalable, seed-free way to obtain formal input specifications from existing parsers, with open-source tooling and data.

Abstract

Generating effective test inputs for a software system requires that these inputs be valid, as they will otherwise be rejected without reaching actual functionality. In the absence of a specification for the input language, common test generation techniques rely on sample inputs, which are abstracted into matching grammars and/or evolved guided by test coverage. However, if sample inputs miss features of the input language, the chances of generating these features randomly are slim. In this work, we present the first technique for symbolically and automatically mining input grammars from the code of recursive descent parsers. So far, the complexity of parsers has made such a symbolic analysis challenging to impossible. Our realization of the symbolic parsing technique overcomes these challenges by (1) associating each parser function parse_ELEM() with a nonterminal <ELEM>; (2) limiting recursive calls and loop iterations, such that a symbolic analysis of parse_ELEM() needs to consider only a finite number of paths; and (3) for each path, create an expansion alternative for <ELEM>. Being purely static, symbolic parsing does not require seed inputs; as it mitigates path explosion, it scales to complex parsers. Our evaluation promises symbolic parsing to be highly accurate. Applied on parsers for complex languages such as TINY-C or JSON, our STALAGMITE implementation extracts grammars with an accuracy of 99--100%, widely improving over the state of the art despite requiring only the program code and no input samples. The resulting grammars cover the entire input space, allowing for comprehensive and effective test generation, reverse engineering, and documentation.

Inferring Input Grammars from Code with Symbolic Parsing

TL;DR

This paper introduces STALAGMITE, the first static approach to mining input grammars directly from the code of recursive-descent parsers without requiring sample inputs. By performing symbolic execution with bounded loops and recursion, recording all input accesses, and generalizing traces into a context-aware grammar, STALAGMITE produces accurate, readable grammars that cover the entire input space. The method demonstrates high precision and recall (often 99–100%) across diverse parsers and uncovers subtle parser bugs through grammar evaluation. The work advances test input generation, reverse engineering, and documentation by providing a scalable, seed-free way to obtain formal input specifications from existing parsers, with open-source tooling and data.

Abstract

Generating effective test inputs for a software system requires that these inputs be valid, as they will otherwise be rejected without reaching actual functionality. In the absence of a specification for the input language, common test generation techniques rely on sample inputs, which are abstracted into matching grammars and/or evolved guided by test coverage. However, if sample inputs miss features of the input language, the chances of generating these features randomly are slim. In this work, we present the first technique for symbolically and automatically mining input grammars from the code of recursive descent parsers. So far, the complexity of parsers has made such a symbolic analysis challenging to impossible. Our realization of the symbolic parsing technique overcomes these challenges by (1) associating each parser function parse_ELEM() with a nonterminal <ELEM>; (2) limiting recursive calls and loop iterations, such that a symbolic analysis of parse_ELEM() needs to consider only a finite number of paths; and (3) for each path, create an expansion alternative for <ELEM>. Being purely static, symbolic parsing does not require seed inputs; as it mitigates path explosion, it scales to complex parsers. Our evaluation promises symbolic parsing to be highly accurate. Applied on parsers for complex languages such as TINY-C or JSON, our STALAGMITE implementation extracts grammars with an accuracy of 99--100%, widely improving over the state of the art despite requiring only the program code and no input samples. The resulting grammars cover the entire input space, allowing for comprehensive and effective test generation, reverse engineering, and documentation.

Paper Structure

This paper contains 33 sections, 14 figures, 6 tables, 3 algorithms.

Figures (14)

  • Figure 1: HARRYDC-JSON grammar mined by STALAGMITE
  • Figure 2: How STALAGMITE works. STALAGMITE infers context-free input grammars from recursive-descent parsers. It starts by symbolically executing and tracing the program under test, comprehensively exploring execution paths by limiting loop iterations and recursive calls, while tracking input consumptions by the parser. Subsequently, these symbolic execution traces are converted into an input grammar by leveraging execution context information. Finally, this input grammar is refined to reduce overapproximation.
  • Figure 3: A simplified execution trace for HARRYDC-JSON
  • Figure 4: Parse tree derived from \ref{['tab:sample-execution-trace']}
  • Figure 5: CJSON copies a part of the input buffer to the buffer number_c_string during parsing.
  • ...and 9 more figures

Theorems & Definitions (2)

  • Definition 1
  • Definition 2