Table of Contents
Fetching ...

Cross-domain Chinese Sentence Pattern Parsing

Jingsi Yu, Cunliang Kong, Liner Yang, Meishan Zhang, Lin Zhu, Yujie Wang, Haozhe Lin, Maosong Sun, Erhong Yang

TL;DR

This work tackles cross-domain sentence pattern structure (SPS) parsing, which is hindered by reliance on textbook corpora. It proposes an LLM-enhanced self-training framework that injects partial syntactic rules from a source domain into target-domain prompts to generate domain-specific training data, mitigating domain shift. The approach combines rule-based filtering with LLM-generated data and uses a Berkeley Neural Parser as a backbone, evaluating against a rule-based mapping baseline on STB (Textbook) to CTB (News) transfer and achieving a $1.68$ point improvement in the $F1$ metric. The results demonstrate meaningful cross-domain adaptation in SPS parsing and offer a framework for data-efficient cross-domain syntactic analysis, with code and data to be released on GitHub.

Abstract

Sentence Pattern Structure (SPS) parsing is a syntactic analysis method primarily employed in language teaching.Existing SPS parsers rely heavily on textbook corpora for training, lacking cross-domain capability.To overcome this constraint, this paper proposes an innovative approach leveraging large language models (LLMs) within a self-training framework. Partial syntactic rules from a source domain are combined with target domain sentences to dynamically generate training data, enhancing the adaptability of the parser to diverse domains.Experiments conducted on textbook and news domains demonstrate the effectiveness of the proposed method, outperforming rule-based baselines by 1.68 points on F1 metrics.

Cross-domain Chinese Sentence Pattern Parsing

TL;DR

This work tackles cross-domain sentence pattern structure (SPS) parsing, which is hindered by reliance on textbook corpora. It proposes an LLM-enhanced self-training framework that injects partial syntactic rules from a source domain into target-domain prompts to generate domain-specific training data, mitigating domain shift. The approach combines rule-based filtering with LLM-generated data and uses a Berkeley Neural Parser as a backbone, evaluating against a rule-based mapping baseline on STB (Textbook) to CTB (News) transfer and achieving a point improvement in the metric. The results demonstrate meaningful cross-domain adaptation in SPS parsing and offer a framework for data-efficient cross-domain syntactic analysis, with code and data to be released on GitHub.

Abstract

Sentence Pattern Structure (SPS) parsing is a syntactic analysis method primarily employed in language teaching.Existing SPS parsers rely heavily on textbook corpora for training, lacking cross-domain capability.To overcome this constraint, this paper proposes an innovative approach leveraging large language models (LLMs) within a self-training framework. Partial syntactic rules from a source domain are combined with target domain sentences to dynamically generate training data, enhancing the adaptability of the parser to diverse domains.Experiments conducted on textbook and news domains demonstrate the effectiveness of the proposed method, outperforming rule-based baselines by 1.68 points on F1 metrics.
Paper Structure (18 sections, 2 equations, 5 figures, 4 tables)

This paper contains 18 sections, 2 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The SPS diagram and corresponding tree for the sentence "各项指标增幅远远高于发展速度 (The growth rates of all indicators far exceed the development speed)". These two illustrations are essentially equivalent.
  • Figure 2: LLM-enhanced self-training frameworks for cross-domain SPS parsing.
  • Figure 3: LLMs prompts example for generating sentences based on syntactic rules and target domain instances. Note that the blue markers and dotted lines are not components of the actual prompt but are included solely for illustrative purposes.
  • Figure 4: Example of removing redundant non-leaf node POS tags. Note that in the SPS grammar, punctuation marks are treated as suffixes to the preceding word.
  • Figure 5: Examples of SPS parsing by the baseline and our method. The left side shows the gold standard, the middle displays the results of baseline parsing, and the right side presents the results parsed by the method of LLM-enhanced Self-training + CRS-based criteria.