Character-Level Chinese Dependency Parsing via Modeling Latent Intra-Word Structure
Yang Hou, Zhenghua Li
TL;DR
This work tackles Chinese dependency parsing without explicit word boundaries by modeling latent intra-word structures that turn a word-level dependency tree into a forest of character-level trees. It introduces two compatibility constraints and a Constrained Eisner algorithm to ensure that character-level trees properly reflect word-level structure, while a coarse-to-fine strategy guides arc role assignment to improve plausibility. The approach yields latent models that outperform pipeline and prior joint models on CTB datasets and reveals that the coarse-to-fine decoding reduces leftward intra-word biases, producing more linguistically plausible structures. The method advances character-level Chinese parsing by enabling flexible intra-word representations and disciplined decoding, with code to be released for reproducibility and broader impact.
Abstract
Revealing the syntactic structure of sentences in Chinese poses significant challenges for word-level parsers due to the absence of clear word boundaries. To facilitate a transition from word-level to character-level Chinese dependency parsing, this paper proposes modeling latent internal structures within words. In this way, each word-level dependency tree is interpreted as a forest of character-level trees. A constrained Eisner algorithm is implemented to ensure the compatibility of character-level trees, guaranteeing a single root for intra-word structures and establishing inter-word dependencies between these roots. Experiments on Chinese treebanks demonstrate the superiority of our method over both the pipeline framework and previous joint models. A detailed analysis reveals that a coarse-to-fine parsing strategy empowers the model to predict more linguistically plausible intra-word structures.
