Table of Contents
Fetching ...

V-Star: Learning Visibly Pushdown Grammars from Program Inputs

Xiaodong Jia, Gang Tan

TL;DR

V-Star addresses the challenge of inferring program-input grammars by exploiting nesting structures through Visibly Pushdown Grammars (VPGs) and an $L^*$-style active-learning framework. It introduces novel tagging inference methods to identify call/return boundaries on characters or tokens, and extends learning to token-based VPLs with a tokenizer-aware converter. The approach yields theoretical guarantees for exact learning under practical assumptions and demonstrates superior accuracy on real-world grammars (e.g., S-Expressions, JSON, XML) compared with state-of-the-art CFG-learners, at the cost of increased query effort. Overall, V-Star offers a robust, nesting-focused pathway for precise input-description grammars, with potential extensions to efficiency, tokenizer inference, and broader VPG classes.

Abstract

Accurate description of program inputs remains a critical challenge in the field of programming languages. Active learning, as a well-established field, achieves exact learning for regular languages. We offer an innovative grammar inference tool, V-Star, based on the active learning of visibly pushdown automata. V-Star deduces nesting structures of program input languages from sample inputs, employing a novel inference mechanism based on nested patterns. This mechanism identifies token boundaries and converts languages such as XML documents into VPLs. We then adapted Angluin's L-Star, an exact learning algorithm, for VPA learning, which improves the precision of our tool. Our evaluation demonstrates that V-Star effectively and efficiently learns a variety of practical grammars, including S-Expressions, JSON, and XML, and outperforms other state-of-the-art tools.

V-Star: Learning Visibly Pushdown Grammars from Program Inputs

TL;DR

V-Star addresses the challenge of inferring program-input grammars by exploiting nesting structures through Visibly Pushdown Grammars (VPGs) and an -style active-learning framework. It introduces novel tagging inference methods to identify call/return boundaries on characters or tokens, and extends learning to token-based VPLs with a tokenizer-aware converter. The approach yields theoretical guarantees for exact learning under practical assumptions and demonstrates superior accuracy on real-world grammars (e.g., S-Expressions, JSON, XML) compared with state-of-the-art CFG-learners, at the cost of increased query effort. Overall, V-Star offers a robust, nesting-focused pathway for precise input-description grammars, with potential extensions to efficiency, tokenizer inference, and broader VPG classes.

Abstract

Accurate description of program inputs remains a critical challenge in the field of programming languages. Active learning, as a well-established field, achieves exact learning for regular languages. We offer an innovative grammar inference tool, V-Star, based on the active learning of visibly pushdown automata. V-Star deduces nesting structures of program input languages from sample inputs, employing a novel inference mechanism based on nested patterns. This mechanism identifies token boundaries and converts languages such as XML documents into VPLs. We then adapted Angluin's L-Star, an exact learning algorithm, for VPA learning, which improves the precision of our tool. Our evaluation demonstrates that V-Star effectively and efficiently learns a variety of practical grammars, including S-Expressions, JSON, and XML, and outperforms other state-of-the-art tools.
Paper Structure (28 sections, 24 theorems, 41 equations, 2 figures, 1 table, 5 algorithms)

This paper contains 28 sections, 24 theorems, 41 equations, 2 figures, 1 table, 5 algorithms.

Key Result

Proposition 4.1

If $\mathcal{Q}= \{(Q_i,C_i)\,\mid\,i \in [0..k]\}$ is separable and language $\hat{\mathcal{L}}=\{ t(s) \mid s\in\mathcal{L} \}$ is a VPL, then the number of states in $\mathbf{constructVPA}(\mathcal{Q})$ is bounded above by the number of states in the minimal $k$-SEVPA for VPL $\hat{\mathcal{L}}$.

Figures (2)

  • Figure 1: An oracle VPG and a set of seed strings.
  • Figure 2: An example XML grammar and the associated lexical rules.

Theorems & Definitions (49)

  • Definition 3.1: Well-matched VPGs
  • Definition 4.1
  • Definition 4.2: Nested Words and $\Sigma_M$
  • Definition 4.3: $\mathbf{constructVPA}(\mathcal{Q})$ function
  • Proposition 4.1
  • proof
  • Proposition 4.2
  • proof
  • Proposition 4.3
  • proof
  • ...and 39 more