Punctuation-aware treebank tree binarization
Eitan Klinger, Vivaan Wadhwa, Jungyeul Park
TL;DR
This work tackles the distortion caused by dropping punctuation in treebank binarization by introducing a punctuation-aware binarization pipeline that keeps punctuation as sibling syntax elements prior to binarization. The method uses a deterministic, invertible transformation with intermediate @X markers to preserve boundary cues, producing binary trees that remain faithful to the original structures and align better with derivational formalisms such as CCG. Empirical results on the Penn Treebank show substantial gains in head–child identification accuracy ($73.66\% \rightarrow 91.85\%$ with the punctuation-aware approach) and competitive alignment with CCGbank, while guaranteeing full reversibility and cross-resource interoperability. The contributions include reproducible code, metadata artifacts, and a versatile framework adaptable to multiple languages, enhancing reliability, transparency, and extensibility of treebank preprocessing.
Abstract
This article presents a curated resource and evaluation suite for punctuation-aware treebank binarization. Standard binarization pipelines drop punctuation before head selection, which alters constituent shape and harms head-child identification. We release (1) a reproducible pipeline that preserves punctuation as sibling nodes prior to binarization, (2) derived artifacts and metadata (intermediate @X markers, reversibility signatures, alignment indices), and (3) an accompanying evaluation suite covering head-child prediction, round-trip reversibility, and structural compatibility with derivational resources (CCGbank). On the Penn Treebank, punctuation-aware preprocessing improves head prediction accuracy from 73.66\% (Collins rules) and 86.66\% (MLP) to 91.85\% with the same classifier, and achieves competitive alignment against CCGbank derivations. All code, configuration files, and documentation are released to enable replication and extension to other corpora.
