Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation
Wang Ling, Tiago Luís, Luís Marujo, Ramón Fernandez Astudillo, Silvio Amir, Chris Dyer, Alan W. Black, Isabel Trancoso
TL;DR
The paper tackles the cost and generalization limits of traditional word lookup embeddings by introducing the C2W model, which composes word vectors from character sequences using bidirectional LSTMs. This approach yields open-vocabulary word representations with substantially fewer parameters, capturing both regular morphological patterns and non-compositional form-function relations. Empirically, C2W improves language modeling across multiple languages (most notably Turkish) and achieves state-of-the-art or near state-of-the-art results in POS tagging without hand-crafted features. The results demonstrate that character-level composition can automatically learn rich lexical features while remaining efficient through caching and shared character representations, making it highly scalable for morphologically rich languages.
Abstract
We introduce a model for constructing vector representations of words by composing characters using bidirectional LSTMs. Relative to traditional word representation models that have independent vectors for each word type, our model requires only a single vector per character type and a fixed set of parameters for the compositional model. Despite the compactness of this model and, more importantly, the arbitrary nature of the form-function relationship in language, our "composed" word representations yield state-of-the-art results in language modeling and part-of-speech tagging. Benefits over traditional baselines are particularly pronounced in morphologically rich languages (e.g., Turkish).
