Table of Contents
Fetching ...

A Truly Joint Neural Architecture for Segmentation and Parsing

Danit Yshaayahu Levi, Reut Tsarfaty

TL;DR

A joint neural architecture where a lattice-based representation preserving all morphological ambiguity of the input is provided to an arc-factored model, which then solves the morphological segmentation and syntactic parsing tasks at once.

Abstract

Contemporary multilingual dependency parsers can parse a diverse set of languages, but for Morphologically Rich Languages (MRLs), performance is attested to be lower than other languages. The key challenge is that, due to high morphological complexity and ambiguity of the space-delimited input tokens, the linguistic units that act as nodes in the tree are not known in advance. Pre-neural dependency parsers for MRLs subscribed to the joint morpho-syntactic hypothesis, stating that morphological segmentation and syntactic parsing should be solved jointly, rather than as a pipeline where segmentation precedes parsing. However, neural state-of-the-art parsers to date use a strict pipeline. In this paper we introduce a joint neural architecture where a lattice-based representation preserving all morphological ambiguity of the input is provided to an arc-factored model, which then solves the morphological segmentation and syntactic parsing tasks at once. Our experiments on Hebrew, a rich and highly ambiguous MRL, demonstrate state-of-the-art performance on parsing, tagging and segmentation of the Hebrew section of UD, using a single model. This proposed architecture is LLM-based and language agnostic, providing a solid foundation for MRLs to obtain further performance improvements and bridge the gap with other languages.

A Truly Joint Neural Architecture for Segmentation and Parsing

TL;DR

A joint neural architecture where a lattice-based representation preserving all morphological ambiguity of the input is provided to an arc-factored model, which then solves the morphological segmentation and syntactic parsing tasks at once.

Abstract

Contemporary multilingual dependency parsers can parse a diverse set of languages, but for Morphologically Rich Languages (MRLs), performance is attested to be lower than other languages. The key challenge is that, due to high morphological complexity and ambiguity of the space-delimited input tokens, the linguistic units that act as nodes in the tree are not known in advance. Pre-neural dependency parsers for MRLs subscribed to the joint morpho-syntactic hypothesis, stating that morphological segmentation and syntactic parsing should be solved jointly, rather than as a pipeline where segmentation precedes parsing. However, neural state-of-the-art parsers to date use a strict pipeline. In this paper we introduce a joint neural architecture where a lattice-based representation preserving all morphological ambiguity of the input is provided to an arc-factored model, which then solves the morphological segmentation and syntactic parsing tasks at once. Our experiments on Hebrew, a rich and highly ambiguous MRL, demonstrate state-of-the-art performance on parsing, tagging and segmentation of the Hebrew section of UD, using a single model. This proposed architecture is LLM-based and language agnostic, providing a solid foundation for MRLs to obtain further performance improvements and bridge the gap with other languages.
Paper Structure (28 sections, 9 equations, 3 figures, 8 tables)

This paper contains 28 sections, 9 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: The morphological lattice for the Hebrew phrase bclm hneim and two associated dependency trees depicting alternative segmentations (Origin: yap (yap)). The upper tree illustrates the syntactic structure corresponding to "In their pleasant shadow", while the lower tree corresponds to "Their onion was pleasant". This highlights the existence of multiple morphological decompositions and various potential dependency trees.
  • Figure 2: The Head matrix of the Hebrew sentence bkrti bbit hlbn. Each row depicts the scores assigned to all heads of a particular segment (including the root and auxiliary tokens). The darker color indicates a higher score. The input to dozatmanning's original architecture consists of the root and gray-marked segments.
  • Figure 3: The comprehensive architecture examines the phrase 'bbit hlbn' (in-the-house the-white), encompassing the processes of morphological analysis, generating context for each analysis, acquiring contextualized embeddings, constructing a dependency tree, and predicting linguistic features.