Table of Contents
Fetching ...

Beyond Subtokens: A Rich Character Embedding for Low-resource and Morphologically Complex Languages

Felix Schneider, Maria Gogolev, Sven Sickert, Joachim Denzler

TL;DR

This work proposes a transformer-based approach to computes word vectors directly from character strings, integrating both semantic and syntactic information, and denotes this transformer-based approach Rich Character Embeddings (RCE).

Abstract

Tokenization and sub-tokenization based models like word2vec, BERT and the GPTs are the state-of-the-art in natural language processing. Typically, these approaches have limitations with respect to their input representation. They fail to fully capture orthographic similarities and morphological variations, especially in highly inflected and under-resource languages. To mitigate this problem, we propose to computes word vectors directly from character strings, integrating both semantic and syntactic information. We denote this transformer-based approach Rich Character Embeddings (RCE). Furthermore, we propose a hybrid model that combines transformer and convolutional mechanisms. Both vector representations can be used as a drop-in replacement for dictionary- and subtoken-based word embeddings in existing model architectures. It has the potential to improve performance for both large context-based language models like BERT and small models like word2vec for under-resourced and morphologically rich languages. We evaluate our approach on various tasks like the SWAG, declension prediction for inflected languages, metaphor and chiasmus detection for various languages. Our experiments show that it outperforms traditional token-based approaches on limited data using OddOneOut and TopK metrics.

Beyond Subtokens: A Rich Character Embedding for Low-resource and Morphologically Complex Languages

TL;DR

This work proposes a transformer-based approach to computes word vectors directly from character strings, integrating both semantic and syntactic information, and denotes this transformer-based approach Rich Character Embeddings (RCE).

Abstract

Tokenization and sub-tokenization based models like word2vec, BERT and the GPTs are the state-of-the-art in natural language processing. Typically, these approaches have limitations with respect to their input representation. They fail to fully capture orthographic similarities and morphological variations, especially in highly inflected and under-resource languages. To mitigate this problem, we propose to computes word vectors directly from character strings, integrating both semantic and syntactic information. We denote this transformer-based approach Rich Character Embeddings (RCE). Furthermore, we propose a hybrid model that combines transformer and convolutional mechanisms. Both vector representations can be used as a drop-in replacement for dictionary- and subtoken-based word embeddings in existing model architectures. It has the potential to improve performance for both large context-based language models like BERT and small models like word2vec for under-resourced and morphologically rich languages. We evaluate our approach on various tasks like the SWAG, declension prediction for inflected languages, metaphor and chiasmus detection for various languages. Our experiments show that it outperforms traditional token-based approaches on limited data using OddOneOut and TopK metrics.
Paper Structure (28 sections, 3 figures, 8 tables)

This paper contains 28 sections, 3 figures, 8 tables.

Figures (3)

  • Figure 1: An example for potential of subtokenization based embedding and our solution. We see the words first in a subtokenized manner and then with in our novel approach. The tokens that qualification is split into are distinct to the token for quality. Also, one word is split into three vectors. In contrast, our novel approach computes a single vector representation for each word and takes the spelling similarity into account.
  • Figure 2: A simple example sentence in various Germanic languages. The similarities between the languages are obvious. However, even the similar words would be treated as completely distinct tokens in a dictionary-based approach like WordPiece, without any information about their similarity in the input.
  • Figure 3: This figure shows the model architecture for the Rich Character Embedding. The input token, in this case the word Token gets represented in the input as its character string, with the capital T split up into an [UP] modifier and the base character t. The character string is then transformed into a vector representation by the transformer encoder. The output of the encoder is then used as the word embedding.