Incremental Context-free Grammar Inference in Black Box Settings

Feifei Li; Xiao Chen; Xi Xiao; Xiaoyu Sun; Chuan Chen; Shaohua Wang; Jitao Han

Incremental Context-free Grammar Inference in Black Box Settings

Feifei Li, Xiao Chen, Xi Xiao, Xiaoyu Sun, Chuan Chen, Shaohua Wang, Jitao Han

TL;DR

A novel method that segments example strings into smaller units and incrementally infers the grammar, named Kedavra, has demonstrated superior grammar quality, faster runtime, and improved readability through empirical comparison.

Abstract

Black-box context-free grammar inference presents a significant challenge in many practical settings due to limited access to example programs. The state-of-the-art methods, Arvada and Treevada, employ heuristic approaches to generalize grammar rules, initiating from flat parse trees and exploring diverse generalization sequences. We have observed that these approaches suffer from low quality and readability, primarily because they process entire example strings, adding to the complexity and substantially slowing down computations. To overcome these limitations, we propose a novel method that segments example strings into smaller units and incrementally infers the grammar. Our approach, named Kedavra, has demonstrated superior grammar quality (enhanced precision and recall), faster runtime, and improved readability through empirical comparison.

Incremental Context-free Grammar Inference in Black Box Settings

TL;DR

Abstract

Paper Structure (25 sections, 5 figures, 5 tables, 5 algorithms)

This paper contains 25 sections, 5 figures, 5 tables, 5 algorithms.

Introduction
Background
Context-free grammar
Black-box Grammar Inference
SOTA - Arvada and Treevada
Motivating Example
Approach
Tokenization
Data Decomposition
Incremental Grammar Inference
Bubbling.
Choose the Most Generalizable Bubble Set
Eliminating Over-generalization
Merging and Grammar Simplification
Generalize Rep and Expansion of Terminals
...and 10 more sections

Figures (5)

Figure 1: A simple example of Arvada workflow
Figure 2: Workflow of Kedavra
Figure 3: Results after pre-tokenization
Figure 4: avg F1 score of 10 runs of Arvada, Treevada and Kedavra on each dataset (R0, R1, R2, R5). Note that the horizontal bars in each of the sub-figures are manually added as a reference to better visualize the fluctuations in the F1 scores of the inference algorithm across different datasets.
Figure 5: Precision of Arvada,Treevada and Kedavra runs on different sample methods(A = Arvada's sample algorithm,T = Treevada's sample algorithm,LPP10 = LimitPerProd10). Note that the horizontal bars in each of the sub-figures are manually added as a reference to better visualize the fluctuations in the precision values of the inference algorithm across different sampling algorithms.

Incremental Context-free Grammar Inference in Black Box Settings

TL;DR

Abstract

Incremental Context-free Grammar Inference in Black Box Settings

Authors

TL;DR

Abstract

Table of Contents

Figures (5)