A Tree Sampler for Bounded Context-Free Languages
Breandan Considine
TL;DR
This paper tackles uniform sampling of parse trees from bounded context-free languages (BCFLs) defined by porous strings with holes. It introduces an algebraic, nested datatype framework using $\mathbb{T}_3$ and $\mathbb{T}_2$ to compactly represent candidate parse forests and compute a fixed point $M_\infty$ that encodes feasible derivations for a given template. For sampling, it develops two modes: with replacement via a Multinoulli recursive sampler related to Boltzmann sampling (avoiding rejection), and without replacement via a counting/ pairing approach that maps trees to indices and lazily decodes from uniformly drawn integers. The method is claimed to be sound and complete for BCFL sampling, supports bounded generation and parallelization, and has practical applications in code completion and program repair, with a Kotlin reference implementation of the $\mathbb{T}_2$ datatype provided.
Abstract
In the following paper, we present a simple method for sampling trees with or without replacement from BCFLs. A BCFL is a context-free language (CFL) corresponding to an incomplete string with holes, which can be completed by valid terminals. To solve this problem, we introduce an algebraic datatype that compactly represents candidate parse forests for porous strings. Once constructed, sampling trees is a straightforward matter of sampling integers uniformly without replacement, then lazily decoding them into trees.
