Table of Contents
Fetching ...

Language Independent Named Entity Recognition via Orthogonal Transformation of Word Vectors

Omar E. Rakha, Hazem M. Abbas

TL;DR

This paper tackles cross-language named entity recognition by decoupling language-specific training from deployment language through an orthogonal embedding alignment. A Bi-LSTM/CRF NER model is trained on English, while target-language word vectors are mapped into English space using an orthogonal transformation learned via SVD on a word-alignment dictionary. The experiments demonstrate that, without any Arabic training data, Arabic NER becomes feasible after alignment, and performance improves substantially across entity types, confirming the approach’s scalability and potential to extend to many languages given corresponding embedding mappings. The method offers a simple, efficient pathway toward language-agnostic NLP systems, supported by publicly available transformation matrices covering multiple languages. This has practical implications for multilingual information extraction and SEO-enabled search systems that must operate with minimal labeled data per language.

Abstract

Word embeddings have been a key building block for NLP in which models relied heavily on word embeddings in many different tasks. In this paper, a model is proposed based on using Bidirectional LSTM/CRF with word embeddings to perform named entity recognition for any language. This is done by training a model on a source language (English) and transforming word embeddings from the target language into word embeddings of the source language by using an orthogonal linear transformation matrix. Evaluation of the model shows that by training a model on an English dataset the model was capable of detecting named entities in an Arabic dataset without neither training or fine tuning the model on an Arabic language dataset.

Language Independent Named Entity Recognition via Orthogonal Transformation of Word Vectors

TL;DR

This paper tackles cross-language named entity recognition by decoupling language-specific training from deployment language through an orthogonal embedding alignment. A Bi-LSTM/CRF NER model is trained on English, while target-language word vectors are mapped into English space using an orthogonal transformation learned via SVD on a word-alignment dictionary. The experiments demonstrate that, without any Arabic training data, Arabic NER becomes feasible after alignment, and performance improves substantially across entity types, confirming the approach’s scalability and potential to extend to many languages given corresponding embedding mappings. The method offers a simple, efficient pathway toward language-agnostic NLP systems, supported by publicly available transformation matrices covering multiple languages. This has practical implications for multilingual information extraction and SEO-enabled search systems that must operate with minimal labeled data per language.

Abstract

Word embeddings have been a key building block for NLP in which models relied heavily on word embeddings in many different tasks. In this paper, a model is proposed based on using Bidirectional LSTM/CRF with word embeddings to perform named entity recognition for any language. This is done by training a model on a source language (English) and transforming word embeddings from the target language into word embeddings of the source language by using an orthogonal linear transformation matrix. Evaluation of the model shows that by training a model on an English dataset the model was capable of detecting named entities in an Arabic dataset without neither training or fine tuning the model on an Arabic language dataset.

Paper Structure

This paper contains 12 sections, 19 equations, 20 figures, 3 tables.

Figures (20)

  • Figure 1: An NER Example
  • Figure 2: The IOB format
  • Figure 3: A language independent model
  • Figure 4: The Multi Task Cross Lingual Training model
  • Figure 5: Architecture of the proposed model
  • ...and 15 more figures