Table of Contents
Fetching ...

Construction and educational application of a linguistically grounded dependency treebank for Uyghur

Jiaxin Zuo, Yiquan Wang, Yuan Pan, Xiadiya Yibulayin

TL;DR

Uyghur’s agglutinative morphology and frequent zero copula present challenges for education-focused NLP under Universal Dependencies. The authors propose MUDT, a linguistically grounded dependency framework with a four-layer morphological decomposition and targeted dependency relations (zero copula, postpositional head, and compound predicates) built via a hybrid AI–human pipeline on 3,456 sentences. Intrinsic and extrinsic evaluations show substantial reductions in non-projectivity and improved parsing performance, while a prototype AI-assisted grammar tutor demonstrates significant learning gains (mean gain $13.73$ vs $7.88$, $p=0.018$, $d=0.90$). The work shows that preserving fine-grained morphosyntactic information yields pedagogically actionable feedback and stronger educational outcomes for low-resource languages, with data and code available for replication.

Abstract

Developing effective educational technologies for low-resource agglutinative languages like Uyghur is often hindered by the mismatch between existing annotation frameworks and specific grammatical structures. To address this challenge, this study introduces the Modern Uyghur Dependency Treebank (MUDT), a linguistically grounded annotation framework specifically designed to capture the agglutinative complexity of Uyghur, including zero copula constructions and fine-grained case marking. Utilizing a hybrid pipeline that combines Large Language Model pre-annotation with rigorous human correction, a high-quality treebank consisting of 3,456 sentences was constructed. Intrinsic structural evaluation reveals that MUDT significantly improves dependency projectivity by reducing the crossing-arc rate from 7.35\% in the Universal Dependencies standard to 0.06\%. Extrinsic parsing experiments using UDPipe and Stanza further demonstrate that models trained on MUDT achieve superior in-domain accuracy and cross-domain generalization compared to UD-based baselines. To validate the practical utility of this computational resource, an AI-assisted grammar tutoring system was developed to translate MUDT-based syntactic analyses into interpretable pedagogical feedback. A controlled experiment involving 35 second-language learners indicated that students receiving syntax-aware feedback achieved significantly higher learning gains compared to those in a control group. These findings establish MUDT as a robust foundation for syntactic analysis and underscore the critical role of linguistically informed natural language processing resources in bridging the gap between computational models and the cognitive needs of second-language learners.

Construction and educational application of a linguistically grounded dependency treebank for Uyghur

TL;DR

Uyghur’s agglutinative morphology and frequent zero copula present challenges for education-focused NLP under Universal Dependencies. The authors propose MUDT, a linguistically grounded dependency framework with a four-layer morphological decomposition and targeted dependency relations (zero copula, postpositional head, and compound predicates) built via a hybrid AI–human pipeline on 3,456 sentences. Intrinsic and extrinsic evaluations show substantial reductions in non-projectivity and improved parsing performance, while a prototype AI-assisted grammar tutor demonstrates significant learning gains (mean gain vs , , ). The work shows that preserving fine-grained morphosyntactic information yields pedagogically actionable feedback and stronger educational outcomes for low-resource languages, with data and code available for replication.

Abstract

Developing effective educational technologies for low-resource agglutinative languages like Uyghur is often hindered by the mismatch between existing annotation frameworks and specific grammatical structures. To address this challenge, this study introduces the Modern Uyghur Dependency Treebank (MUDT), a linguistically grounded annotation framework specifically designed to capture the agglutinative complexity of Uyghur, including zero copula constructions and fine-grained case marking. Utilizing a hybrid pipeline that combines Large Language Model pre-annotation with rigorous human correction, a high-quality treebank consisting of 3,456 sentences was constructed. Intrinsic structural evaluation reveals that MUDT significantly improves dependency projectivity by reducing the crossing-arc rate from 7.35\% in the Universal Dependencies standard to 0.06\%. Extrinsic parsing experiments using UDPipe and Stanza further demonstrate that models trained on MUDT achieve superior in-domain accuracy and cross-domain generalization compared to UD-based baselines. To validate the practical utility of this computational resource, an AI-assisted grammar tutoring system was developed to translate MUDT-based syntactic analyses into interpretable pedagogical feedback. A controlled experiment involving 35 second-language learners indicated that students receiving syntax-aware feedback achieved significantly higher learning gains compared to those in a control group. These findings establish MUDT as a robust foundation for syntactic analysis and underscore the critical role of linguistically informed natural language processing resources in bridging the gap between computational models and the cognitive needs of second-language learners.

Paper Structure

This paper contains 25 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Schematic representation of the full-stack research framework. The framework integrates linguistic theory, resource construction, model training, and educational application through four logical layers. Layer 1 establishes the linguistic foundation by adapting the Universal Dependencies framework to accommodate Uyghur agglutinative features through the MUDT design. Layer 2 details the treebank construction process using a hybrid human-AI loop to produce the gold standard corpus. Layer 3 illustrates the computational modeling phase where various parsers are evaluated to validate the structural fidelity of MUDT. Layer 4 demonstrates the educational application workflow where the system generates interpretable feedback to enhance learner outcomes.
  • Figure 2: Structural comparison of three critical linguistic phenomena. The top row (a-c) shows typical UD structures (often semantically opaque). The bottom row (d-f) shows the proposed MUDT structures designed for pedagogical clarity. Columns correspond to Zero Copula, Postpositional Phrases, and Compound Predicates respectively.
  • Figure 3: Structural Comparison. (a) UDT-style analysis where the postposition incorrectly depends on the noun. (b) MUDT analysis correctly modeling the postposition as the head.
  • Figure 4: Operational flowchart of the syntactic diagnosis and explainable feedback mechanism in MUDT-Tutor. The process is illustrated using a postposition correction task. (a) The Learner Input Interface presents the front-end interaction where students submit corrections for ungrammatical sentences. (b) The Syntax-Aware Diagnosis module performs parallel parsing of the input pair to extract dependency relations and verify structural validity against the rule engine. (c) The Intelligent Scaffolding Feedback stage generates a dynamic response panel displaying the correctness verdict, standard solution, error classification, and specific pedagogical guidance based on the underlying syntactic analysis.