Table of Contents
Fetching ...

Improved Transition-Based Parsing by Modeling Characters instead of Words with LSTMs

Miguel Ballesteros, Chris Dyer, Noah A. Smith

TL;DR

The paper tackles dependency parsing in morphologically rich languages by replacing word-lookup embeddings with character-based encodings learned via bidirectional LSTMs. It extends a high-performance continuous-state, transition-based parser with stack LSTMs and a swap operation to support nonprojective trees, enabling morphology-aware parsing without explicit morphological annotations. Empirical results on SPMRL languages show substantial gains, especially for agglutinative languages and OOV words, with Char+POS often achieving the best LAS. The findings suggest morphology can be learned from orthography, reducing reliance on manual morphological features and highlighting the potential of character-based representations for robust parsing across diverse languages.

Abstract

We present extensions to a continuous-state dependency parsing method that makes it applicable to morphologically rich languages. Starting with a high-performance transition-based parser that uses long short-term memory (LSTM) recurrent neural networks to learn representations of the parser state, we replace lookup-based word representations with representations constructed from the orthographic representations of the words, also using LSTMs. This allows statistical sharing across word forms that are similar on the surface. Experiments for morphologically rich languages show that the parsing model benefits from incorporating the character-based encodings of words.

Improved Transition-Based Parsing by Modeling Characters instead of Words with LSTMs

TL;DR

The paper tackles dependency parsing in morphologically rich languages by replacing word-lookup embeddings with character-based encodings learned via bidirectional LSTMs. It extends a high-performance continuous-state, transition-based parser with stack LSTMs and a swap operation to support nonprojective trees, enabling morphology-aware parsing without explicit morphological annotations. Empirical results on SPMRL languages show substantial gains, especially for agglutinative languages and OOV words, with Char+POS often achieving the best LAS. The findings suggest morphology can be learned from orthography, reducing reliance on manual morphological features and highlighting the potential of character-based representations for robust parsing across diverse languages.

Abstract

We present extensions to a continuous-state dependency parsing method that makes it applicable to morphologically rich languages. Starting with a high-performance transition-based parser that uses long short-term memory (LSTM) recurrent neural networks to learn representations of the parser state, we replace lookup-based word representations with representations constructed from the orthographic representations of the words, also using LSTMs. This allows statistical sharing across word forms that are similar on the surface. Experiments for morphologically rich languages show that the parsing model benefits from incorporating the character-based encodings of words.

Paper Structure

This paper contains 21 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Parser transitions indicating the action applied to the stack and buffer and the resulting stack and buffer states. Bold symbols indicate (learned) embeddings of words and relations, script symbols indicate the corresponding words and relations. ?) used the shift and reduce operations in their continuous-state parser; we add swap.
  • Figure 2: Baseline model word embeddings for an in-vocabulary word that is tagged with POS tag NN (right) and an out-of-vocabulary word with POS tag JJ (left).
  • Figure 3: Character-based word embedding of the word party. This representation is used for both in-vocabulary and out-of-vocabulary words.
  • Figure 4: Character-based word representations of 30 random words from the English development set (Chars). Dots in red represent past tense verbs; dots in orange represent gerund verbs; dots in black represent present tense verbs; dots in blue represent adjectives; dots in green represent adverbs; dots in yellow represent singular nouns; dots in brown represent plural nouns. The visualization was produced using t-SNE; see http://lvdmaaten.github.io/tsne/.
  • Figure 5: On the $x$-axis is the OOV rate in development data, by treebank; on the $y$-axis is the difference in development-set LAS between Chars model as described in §\ref{['charbased']} and one in which all OOV words are given a single representation.