V-Star: Learning Visibly Pushdown Grammars from Program Inputs

Xiaodong Jia; Gang Tan

V-Star: Learning Visibly Pushdown Grammars from Program Inputs

Xiaodong Jia, Gang Tan

TL;DR

V-Star addresses the challenge of inferring program-input grammars by exploiting nesting structures through Visibly Pushdown Grammars (VPGs) and an $L^*$-style active-learning framework. It introduces novel tagging inference methods to identify call/return boundaries on characters or tokens, and extends learning to token-based VPLs with a tokenizer-aware converter. The approach yields theoretical guarantees for exact learning under practical assumptions and demonstrates superior accuracy on real-world grammars (e.g., S-Expressions, JSON, XML) compared with state-of-the-art CFG-learners, at the cost of increased query effort. Overall, V-Star offers a robust, nesting-focused pathway for precise input-description grammars, with potential extensions to efficiency, tokenizer inference, and broader VPG classes.

Abstract

Accurate description of program inputs remains a critical challenge in the field of programming languages. Active learning, as a well-established field, achieves exact learning for regular languages. We offer an innovative grammar inference tool, V-Star, based on the active learning of visibly pushdown automata. V-Star deduces nesting structures of program input languages from sample inputs, employing a novel inference mechanism based on nested patterns. This mechanism identifies token boundaries and converts languages such as XML documents into VPLs. We then adapted Angluin's L-Star, an exact learning algorithm, for VPA learning, which improves the precision of our tool. Our evaluation demonstrates that V-Star effectively and efficiently learns a variety of practical grammars, including S-Expressions, JSON, and XML, and outperforms other state-of-the-art tools.

V-Star: Learning Visibly Pushdown Grammars from Program Inputs

TL;DR

V-Star addresses the challenge of inferring program-input grammars by exploiting nesting structures through Visibly Pushdown Grammars (VPGs) and an

-style active-learning framework. It introduces novel tagging inference methods to identify call/return boundaries on characters or tokens, and extends learning to token-based VPLs with a tokenizer-aware converter. The approach yields theoretical guarantees for exact learning under practical assumptions and demonstrates superior accuracy on real-world grammars (e.g., S-Expressions, JSON, XML) compared with state-of-the-art CFG-learners, at the cost of increased query effort. Overall, V-Star offers a robust, nesting-focused pathway for precise input-description grammars, with potential extensions to efficiency, tokenizer inference, and broader VPG classes.

Abstract

Paper Structure (28 sections, 24 theorems, 41 equations, 2 figures, 1 table, 5 algorithms)

This paper contains 28 sections, 24 theorems, 41 equations, 2 figures, 1 table, 5 algorithms.

Introduction
Related Work
Background
Grammar Inference
Visibly Pushdown Grammars
Visibly Pushdown Automata
Angluin's L-Star Algorithm
V-Star for a Character-Based VPL
Problem Statement
The Unique Pairing assumption for oracle languages
Learning VPA with Known Tagging
Background: $k$-SEVPA and Congruence Relations
Access Words and Test Words
Tagging Inference
V-Star for a Token-Basd VPL
...and 13 more sections

Key Result

Proposition 4.1

If $\mathcal{Q}= \{(Q_i,C_i)\,\mid\,i \in [0..k]\}$ is separable and language $\hat{\mathcal{L}}=\{ t(s) \mid s\in\mathcal{L} \}$ is a VPL, then the number of states in $\mathbf{constructVPA}(\mathcal{Q})$ is bounded above by the number of states in the minimal $k$-SEVPA for VPL $\hat{\mathcal{L}}$.

Figures (2)

Figure 1: An oracle VPG and a set of seed strings.
Figure 2: An example XML grammar and the associated lexical rules.

Theorems & Definitions (49)

Definition 3.1: Well-matched VPGs
Definition 4.1
Definition 4.2: Nested Words and $\Sigma_M$
Definition 4.3: $\mathbf{constructVPA}(\mathcal{Q})$ function
Proposition 4.1
proof
Proposition 4.2
proof
Proposition 4.3
proof
...and 39 more

V-Star: Learning Visibly Pushdown Grammars from Program Inputs

TL;DR

Abstract

V-Star: Learning Visibly Pushdown Grammars from Program Inputs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (49)