Table of Contents
Fetching ...

Character-Level Chinese Dependency Parsing via Modeling Latent Intra-Word Structure

Yang Hou, Zhenghua Li

TL;DR

This work tackles Chinese dependency parsing without explicit word boundaries by modeling latent intra-word structures that turn a word-level dependency tree into a forest of character-level trees. It introduces two compatibility constraints and a Constrained Eisner algorithm to ensure that character-level trees properly reflect word-level structure, while a coarse-to-fine strategy guides arc role assignment to improve plausibility. The approach yields latent models that outperform pipeline and prior joint models on CTB datasets and reveals that the coarse-to-fine decoding reduces leftward intra-word biases, producing more linguistically plausible structures. The method advances character-level Chinese parsing by enabling flexible intra-word representations and disciplined decoding, with code to be released for reproducibility and broader impact.

Abstract

Revealing the syntactic structure of sentences in Chinese poses significant challenges for word-level parsers due to the absence of clear word boundaries. To facilitate a transition from word-level to character-level Chinese dependency parsing, this paper proposes modeling latent internal structures within words. In this way, each word-level dependency tree is interpreted as a forest of character-level trees. A constrained Eisner algorithm is implemented to ensure the compatibility of character-level trees, guaranteeing a single root for intra-word structures and establishing inter-word dependencies between these roots. Experiments on Chinese treebanks demonstrate the superiority of our method over both the pipeline framework and previous joint models. A detailed analysis reveals that a coarse-to-fine parsing strategy empowers the model to predict more linguistically plausible intra-word structures.

Character-Level Chinese Dependency Parsing via Modeling Latent Intra-Word Structure

TL;DR

This work tackles Chinese dependency parsing without explicit word boundaries by modeling latent intra-word structures that turn a word-level dependency tree into a forest of character-level trees. It introduces two compatibility constraints and a Constrained Eisner algorithm to ensure that character-level trees properly reflect word-level structure, while a coarse-to-fine strategy guides arc role assignment to improve plausibility. The approach yields latent models that outperform pipeline and prior joint models on CTB datasets and reveals that the coarse-to-fine decoding reduces leftward intra-word biases, producing more linguistically plausible structures. The method advances character-level Chinese parsing by enabling flexible intra-word representations and disciplined decoding, with code to be released for reproducibility and broader impact.

Abstract

Revealing the syntactic structure of sentences in Chinese poses significant challenges for word-level parsers due to the absence of clear word boundaries. To facilitate a transition from word-level to character-level Chinese dependency parsing, this paper proposes modeling latent internal structures within words. In this way, each word-level dependency tree is interpreted as a forest of character-level trees. A constrained Eisner algorithm is implemented to ensure the compatibility of character-level trees, guaranteeing a single root for intra-word structures and establishing inter-word dependencies between these roots. Experiments on Chinese treebanks demonstrate the superiority of our method over both the pipeline framework and previous joint models. A detailed analysis reveals that a coarse-to-fine parsing strategy empowers the model to predict more linguistically plausible intra-word structures.
Paper Structure (50 sections, 14 equations, 4 figures, 8 tables, 1 algorithm)

This paper contains 50 sections, 14 equations, 4 figures, 8 tables, 1 algorithm.

Figures (4)

  • Figure 1: A word-level dependency tree and corresponding character-level trees with three types of intra-word structure. Intra-word dependencies are represented by dashed arcs and their labels are omitted.
  • Figure 2: Examples containing illegal arcs. Incorrect characters and arcs are highlighted in red. Triangles represent complete spans, while trapezoids represent incomplete spans. Dashed or solid lines are used to indicate intra-word or inter-word.
  • Figure 3: Deduction rules for coarse-to-fine parsing. Dashed or solid lines are used to indicate intra-word spans (WI) or inter-word spans (WE). The highlighted rule can be ignored to satisfy the root-as-head constraint. We present only R-rules, omitting the symmetric L-rules and initial conditions for brevity.
  • Figure 4: The unlabeled attachment score (UAS) for words of different lengths on CTB7 test set using SD.