Table of Contents
Fetching ...

Unsupervised Cross-lingual Word Embedding by Multilingual Neural Language Models

Takashi Wada, Tomoharu Iwata

TL;DR

This paper tackles unsupervised cross-lingual word embedding without parallel data by introducing a multilingual neural language model that shares bidirectional LSTMs across languages while keeping language-specific embeddings. The model learns a common latent space by jointly training forward and backward language modeling objectives, enabling effective word alignment under low-resource and domain-diverse conditions. Empirical results show superior performance to existing unsupervised methods in bilingual word alignment, with strong robustness in low-resource settings, and demonstrate quadrilingual embeddings across four languages. The work suggests a practical path toward scalable cross-lingual representations for under-resourced languages and sets the stage for semi-supervised extensions with bilingual dictionaries.

Abstract

We propose an unsupervised method to obtain cross-lingual embeddings without any parallel data or pre-trained word embeddings. The proposed model, which we call multilingual neural language models, takes sentences of multiple languages as an input. The proposed model contains bidirectional LSTMs that perform as forward and backward language models, and these networks are shared among all the languages. The other parameters, i.e. word embeddings and linear transformation between hidden states and outputs, are specific to each language. The shared LSTMs can capture the common sentence structure among all languages. Accordingly, word embeddings of each language are mapped into a common latent space, making it possible to measure the similarity of words across multiple languages. We evaluate the quality of the cross-lingual word embeddings on a word alignment task. Our experiments demonstrate that our model can obtain cross-lingual embeddings of much higher quality than existing unsupervised models when only a small amount of monolingual data (i.e. 50k sentences) are available, or the domains of monolingual data are different across languages.

Unsupervised Cross-lingual Word Embedding by Multilingual Neural Language Models

TL;DR

This paper tackles unsupervised cross-lingual word embedding without parallel data by introducing a multilingual neural language model that shares bidirectional LSTMs across languages while keeping language-specific embeddings. The model learns a common latent space by jointly training forward and backward language modeling objectives, enabling effective word alignment under low-resource and domain-diverse conditions. Empirical results show superior performance to existing unsupervised methods in bilingual word alignment, with strong robustness in low-resource settings, and demonstrate quadrilingual embeddings across four languages. The work suggests a practical path toward scalable cross-lingual representations for under-resourced languages and sets the stage for semi-supervised extensions with bilingual dictionaries.

Abstract

We propose an unsupervised method to obtain cross-lingual embeddings without any parallel data or pre-trained word embeddings. The proposed model, which we call multilingual neural language models, takes sentences of multiple languages as an input. The proposed model contains bidirectional LSTMs that perform as forward and backward language models, and these networks are shared among all the languages. The other parameters, i.e. word embeddings and linear transformation between hidden states and outputs, are specific to each language. The shared LSTMs can capture the common sentence structure among all languages. Accordingly, word embeddings of each language are mapped into a common latent space, making it possible to measure the similarity of words across multiple languages. We evaluate the quality of the cross-lingual word embeddings on a word alignment task. Our experiments demonstrate that our model can obtain cross-lingual embeddings of much higher quality than existing unsupervised models when only a small amount of monolingual data (i.e. 50k sentences) are available, or the domains of monolingual data are different across languages.

Paper Structure

This paper contains 18 sections, 12 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Illustration of our proposed multilingual neural language model. The parameters shared among across multiple languages are the ones of forward and backward LSTMs $\overrightarrow{f}$ and $\overleftarrow{f}$, the embedding of $<$BOS$>$, $E^{\rm BOS}$, and the linear projection for $<$EOS$>$, $W^{\rm EOS}$. On the other hand, word embeddings, $E^{\ell}$, and linear projection $W^{\ell}$ are specific to each language $\ell$. The shared LSTMs capture a common structure of multiple languages, and that enables us to map word embeddings $E^{\ell}$ of multiple languages into a common space.
  • Figure 2: Comparison of p@1 accuracy of German-English pair between supervised word mapping method and our model on 50k sentences. The x axis indicates the number of pairs of words $n$ (= 0,50,100,150,..., 450, 500) that were used for the supervised method, but not for ours, to map word embedding spaces in two languages.
  • Figure 3: Graphs show the change in p@1 accuracy of each language pair as the size of training data increases. The x-axis denotes the number of sentences (thousand) in the monolingual training data of the source and target languages.
  • Figure 4: Scatter plot of cross-lingual word embeddings of French, English, German and Spanish obtained by our model. The embeddings are reduced to 2D using tSNE tSNE.