Improved Transition-Based Parsing by Modeling Characters instead of Words with LSTMs
Miguel Ballesteros, Chris Dyer, Noah A. Smith
TL;DR
The paper tackles dependency parsing in morphologically rich languages by replacing word-lookup embeddings with character-based encodings learned via bidirectional LSTMs. It extends a high-performance continuous-state, transition-based parser with stack LSTMs and a swap operation to support nonprojective trees, enabling morphology-aware parsing without explicit morphological annotations. Empirical results on SPMRL languages show substantial gains, especially for agglutinative languages and OOV words, with Char+POS often achieving the best LAS. The findings suggest morphology can be learned from orthography, reducing reliance on manual morphological features and highlighting the potential of character-based representations for robust parsing across diverse languages.
Abstract
We present extensions to a continuous-state dependency parsing method that makes it applicable to morphologically rich languages. Starting with a high-performance transition-based parser that uses long short-term memory (LSTM) recurrent neural networks to learn representations of the parser state, we replace lookup-based word representations with representations constructed from the orthographic representations of the words, also using LSTMs. This allows statistical sharing across word forms that are similar on the surface. Experiments for morphologically rich languages show that the parsing model benefits from incorporating the character-based encodings of words.
