Table of Contents
Fetching ...

Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation

Wang Ling, Tiago Luís, Luís Marujo, Ramón Fernandez Astudillo, Silvio Amir, Chris Dyer, Alan W. Black, Isabel Trancoso

TL;DR

The paper tackles the cost and generalization limits of traditional word lookup embeddings by introducing the C2W model, which composes word vectors from character sequences using bidirectional LSTMs. This approach yields open-vocabulary word representations with substantially fewer parameters, capturing both regular morphological patterns and non-compositional form-function relations. Empirically, C2W improves language modeling across multiple languages (most notably Turkish) and achieves state-of-the-art or near state-of-the-art results in POS tagging without hand-crafted features. The results demonstrate that character-level composition can automatically learn rich lexical features while remaining efficient through caching and shared character representations, making it highly scalable for morphologically rich languages.

Abstract

We introduce a model for constructing vector representations of words by composing characters using bidirectional LSTMs. Relative to traditional word representation models that have independent vectors for each word type, our model requires only a single vector per character type and a fixed set of parameters for the compositional model. Despite the compactness of this model and, more importantly, the arbitrary nature of the form-function relationship in language, our "composed" word representations yield state-of-the-art results in language modeling and part-of-speech tagging. Benefits over traditional baselines are particularly pronounced in morphologically rich languages (e.g., Turkish).

Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation

TL;DR

The paper tackles the cost and generalization limits of traditional word lookup embeddings by introducing the C2W model, which composes word vectors from character sequences using bidirectional LSTMs. This approach yields open-vocabulary word representations with substantially fewer parameters, capturing both regular morphological patterns and non-compositional form-function relations. Empirically, C2W improves language modeling across multiple languages (most notably Turkish) and achieves state-of-the-art or near state-of-the-art results in POS tagging without hand-crafted features. The results demonstrate that character-level composition can automatically learn rich lexical features while remaining efficient through caching and shared character representations, making it highly scalable for morphologically rich languages.

Abstract

We introduce a model for constructing vector representations of words by composing characters using bidirectional LSTMs. Relative to traditional word representation models that have independent vectors for each word type, our model requires only a single vector per character type and a fixed set of parameters for the compositional model. Despite the compactness of this model and, more importantly, the arbitrary nature of the form-function relationship in language, our "composed" word representations yield state-of-the-art results in language modeling and part-of-speech tagging. Benefits over traditional baselines are particularly pronounced in morphologically rich languages (e.g., Turkish).

Paper Structure

This paper contains 26 sections, 3 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Illustration of the word lookup tables (top) and the lexical Composition Model (bottom). Square boxes represent vectors of neuron activations. Shaded boxes indicate that a non-linearity.
  • Figure 2: Illustration of our neural network for Language Modeling.
  • Figure 3: Illustration of our neural network for POS tagging.