Fast Deterministic Black-box Context-free Grammar Inference

Mohammad Rifat Arefin; Suraj Shetiya; Zili Wang; Christoph Csallner

Fast Deterministic Black-box Context-free Grammar Inference

Mohammad Rifat Arefin, Suraj Shetiya, Zili Wang, Christoph Csallner

TL;DR

The paper tackles black-box CFG inference under limited program samples, where prior work like Arvada suffers from nondeterministic exploration and high $O(n^4)$ runtime. It introduces TreeVada, a deterministic approach that pre-structures input programs along bracket nesting, applies learned rules recursively, and uses depth- and length-aware bubble ranking to guide generalization. Empirical evaluation across multiple seed sets shows TreeVada achieves faster runtimes and higher F1 scores than Arvada in most cases, with grammars that are often smaller and more parse-efficient; the method is open-source. This work enhances reproducibility and applicability of black-box CFG inference for languages with closed parsers, aiding tasks such as code comprehension, reverse engineering, and robust test-input generation.

Abstract

Black-box context-free grammar inference is a hard problem as in many practical settings it only has access to a limited number of example programs. The state-of-the-art approach Arvada heuristically generalizes grammar rules starting from flat parse trees and is non-deterministic to explore different generalization sequences. We observe that many of Arvada's generalization steps violate common language concept nesting rules. We thus propose to pre-structure input programs along these nesting rules, apply learnt rules recursively, and make black-box context-free grammar inference deterministic. The resulting TreeVada yielded faster runtime and higher-quality grammars in an empirical comparison. The TreeVada source code, scripts, evaluation parameters, and training data are open-source and publicly available (https://doi.org/10.6084/m9.figshare.23907738).

Fast Deterministic Black-box Context-free Grammar Inference

TL;DR

The paper tackles black-box CFG inference under limited program samples, where prior work like Arvada suffers from nondeterministic exploration and high

runtime. It introduces TreeVada, a deterministic approach that pre-structures input programs along bracket nesting, applies learned rules recursively, and uses depth- and length-aware bubble ranking to guide generalization. Empirical evaluation across multiple seed sets shows TreeVada achieves faster runtimes and higher F1 scores than Arvada in most cases, with grammars that are often smaller and more parse-efficient; the method is open-source. This work enhances reproducibility and applicability of black-box CFG inference for languages with closed parsers, aiding tasks such as code comprehension, reverse engineering, and robust test-input generation.

Abstract

Paper Structure (30 sections, 8 equations, 6 figures, 8 tables)

This paper contains 30 sections, 8 equations, 6 figures, 8 tables.

Introduction
Background
Black-box Grammar Inference
State-of-the-art Inference: Arvada in $O(n^4)$
Arvada Run = 10 Non-deterministic $O(n^4)$ Runs
Not Generalizing Recursively
Breaking Bracket-implied Nesting Structure
Overview and Design
Assumptions on Strings & Brackets
Pre-tokenizing Input Programs
Program Structure in String Literals
Pre-structuring Parse Trees Along Brackets
Removing Specialized Bubbling Heuristics
Deterministic Grammar Inference
Depth- and Length-aware Bubble Ranking
...and 15 more sections

Figures (6)

Figure 1: while's golden grammar $\mathcal{G}_w$ (Arvada's motivating example arvada21ASE, reformatted, plus missing skip rule).
Figure 2: Top to bottom: Input while programs $S_1$ and a resulting Arvada run: initial (pre-tokenized) flat parse trees, initial node-pair merges (green), 1st bubble merge (lime), 2nd bubble merge (yellow) without reapplying rule, and 3rd bubble merge (orange) breaking tree nesting; resulting grammar.
Figure 3: Top to bottom: TreeVada's pre-structured bracket-implied trees for Figure \ref{['fig:arvada-weaknesses']}'s $S_1$ input programs with bracketed sequences (gray) bubbled, initial node-pair merges (yellow & lime), and 1st bubble via bubble-ranking (green); the inferred grammar captures $S_1$'s Figure \ref{['fig:while']} golden grammar rules.
Figure 4: Average (and standard deviation) of time spent on ranking bubbles, sampling strings, and in black-box parser. Each value is normalized by dividing by Arvada's average total runtime for that language (R1, R5 seed); A = Arvada; T = TreeVada.
Figure 5: F1 score of 10 Arvada (-) and TreeVada ($\blacktriangleleft$) runs on hand-picked (H arvada21ASE) and random seeds (R0 arvada21ASE, R1, R2, R5).
...and 1 more figures

Fast Deterministic Black-box Context-free Grammar Inference

TL;DR

Abstract

Fast Deterministic Black-box Context-free Grammar Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (6)