Table of Contents
Fetching ...

LANGALIGN: Enhancing Non-English Language Models via Cross-Lingual Embedding Alignment

Jong Myoung Kim, Young-Jun Lee, Ho-Jin Choi, Sangkeun Jung

TL;DR

LangAlign addresses the data-cost barrier in non-English NLP by introducing a lightweight bridging layer that maps English embeddings to target-language embeddings at the LM–task head interface, formalized as $ \mathcal{A}(e(d_e)) \approx e(\mathcal{T}(d_e))$. The authors explore two architectures, Fully Connected and AutoEncoder LangAlign, and a three-step training pipeline: optional embedding tuning, LangAlign training with an $L_2$ (MSE) objective while keeping the LM fixed, and subsequent task-specific fine-tuning. Across Korean, Japanese, and Chinese, LangAlign consistently beats English-data baselines and often matches or surpasses native data or MT-based baselines, while reducing data-collection costs; a reverse transfer variant (Rev-LangAlign) shows potential for transferring non-English data into English-model space. The work demonstrates substantial practical value for industry by enabling cost-effective cross-lingual transfer with strong performance and by enabling transfer inference through reverse embedding alignment. Overall, LangAlign offers a scalable, data-efficient approach to multilingual embedding alignment with tangible benefits for real-world NLP systems.

Abstract

While Large Language Models have gained attention, many service developers still rely on embedding-based models due to practical constraints. In such cases, the quality of fine-tuning data directly impacts performance, and English datasets are often used as seed data for training non-English models. In this study, we propose LANGALIGN, which enhances target language processing by aligning English embedding vectors with those of the target language at the interface between the language model and the task header. Experiments on Korean, Japanese, and Chinese demonstrate that LANGALIGN significantly improves performance across all three languages. Additionally, we show that LANGALIGN can be applied in reverse to convert target language data into a format that an English-based model can process.

LANGALIGN: Enhancing Non-English Language Models via Cross-Lingual Embedding Alignment

TL;DR

LangAlign addresses the data-cost barrier in non-English NLP by introducing a lightweight bridging layer that maps English embeddings to target-language embeddings at the LM–task head interface, formalized as . The authors explore two architectures, Fully Connected and AutoEncoder LangAlign, and a three-step training pipeline: optional embedding tuning, LangAlign training with an (MSE) objective while keeping the LM fixed, and subsequent task-specific fine-tuning. Across Korean, Japanese, and Chinese, LangAlign consistently beats English-data baselines and often matches or surpasses native data or MT-based baselines, while reducing data-collection costs; a reverse transfer variant (Rev-LangAlign) shows potential for transferring non-English data into English-model space. The work demonstrates substantial practical value for industry by enabling cost-effective cross-lingual transfer with strong performance and by enabling transfer inference through reverse embedding alignment. Overall, LangAlign offers a scalable, data-efficient approach to multilingual embedding alignment with tangible benefits for real-world NLP systems.

Abstract

While Large Language Models have gained attention, many service developers still rely on embedding-based models due to practical constraints. In such cases, the quality of fine-tuning data directly impacts performance, and English datasets are often used as seed data for training non-English models. In this study, we propose LANGALIGN, which enhances target language processing by aligning English embedding vectors with those of the target language at the interface between the language model and the task header. Experiments on Korean, Japanese, and Chinese demonstrate that LANGALIGN significantly improves performance across all three languages. Additionally, we show that LANGALIGN can be applied in reverse to convert target language data into a format that an English-based model can process.

Paper Structure

This paper contains 77 sections, 1 equation, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Task fine-tuning sequence using LangAlign. Learning parts are highlighted in yellow, and frozen parts are shown in gray. Step 1. Initial tuning phase where the LM is fine-tuned to generate task-specific embeddings. This step is optional. Step 2. Training LangAlign to transform $e(d_e)$ into $e(\mathcal{T}(d_e))$. $d_e$ and $\mathcal{T}(d_e)$ carry the same meaning. Step 3. Fine-tuning the task using the trained LangAlign. The model is trained with English data.
  • Figure 2: Ablation study results. (a) Performance of models with the LangAlign layer removed, (b) Performance of models using the same data for both LangAlign training and task-specific tuning.
  • Figure 3: Verification of LangAlign’s embedding transformation capability using cosine similarity
  • Figure 4: Language generalizability evaluation results
  • Figure 5: Layer structure of LangAlign
  • ...and 1 more figures