V-Star: Learning Visibly Pushdown Grammars from Program Inputs
Xiaodong Jia, Gang Tan
TL;DR
V-Star addresses the challenge of inferring program-input grammars by exploiting nesting structures through Visibly Pushdown Grammars (VPGs) and an $L^*$-style active-learning framework. It introduces novel tagging inference methods to identify call/return boundaries on characters or tokens, and extends learning to token-based VPLs with a tokenizer-aware converter. The approach yields theoretical guarantees for exact learning under practical assumptions and demonstrates superior accuracy on real-world grammars (e.g., S-Expressions, JSON, XML) compared with state-of-the-art CFG-learners, at the cost of increased query effort. Overall, V-Star offers a robust, nesting-focused pathway for precise input-description grammars, with potential extensions to efficiency, tokenizer inference, and broader VPG classes.
Abstract
Accurate description of program inputs remains a critical challenge in the field of programming languages. Active learning, as a well-established field, achieves exact learning for regular languages. We offer an innovative grammar inference tool, V-Star, based on the active learning of visibly pushdown automata. V-Star deduces nesting structures of program input languages from sample inputs, employing a novel inference mechanism based on nested patterns. This mechanism identifies token boundaries and converts languages such as XML documents into VPLs. We then adapted Angluin's L-Star, an exact learning algorithm, for VPA learning, which improves the precision of our tool. Our evaluation demonstrates that V-Star effectively and efficiently learns a variety of practical grammars, including S-Expressions, JSON, and XML, and outperforms other state-of-the-art tools.
