Enhancing Cross-lingual Sentence Embedding for Low-resource Languages with Word Alignment

Zhongtao Miao; Qiyu Wu; Kaiyan Zhao; Zilong Wu; Yoshimasa Tsuruoka

Enhancing Cross-lingual Sentence Embedding for Low-resource Languages with Word Alignment

Zhongtao Miao, Qiyu Wu, Kaiyan Zhao, Zilong Wu, Yoshimasa Tsuruoka

TL;DR

This work addresses the scarcity of parallel data for low-resource languages by showing that word representations in such languages are under-aligned with those in high-resource languages in current cross-lingual models. It introduces WACSE, a framework that leverages explicit word alignment supervision from a pre-existing aligner and optimizes three objectives—Aligned Word Prediction ($ ext{L}^{AWP}$), Word Translation Ranking ($ ext{L}^{WTR}$), and Translation Ranking ($ ext{L}^{TR}$)—with the final loss $\,\mathcal{L} = \alpha\mathcal{L}^{TR} + \beta\mathcal{L}^{AWP} + \gamma\mathcal{L}^{WTR}$. Experiments on bitext retrieval, cross-lingual STS, bitext mining, and NLI demonstrate that WACSE yields substantial gains for low-resource languages while maintaining competitive performance on high-resource languages. The results imply that explicit word-level alignment can meaningfully augment cross-lingual sentence embeddings in data-scarce scenarios, with future work aimed at phrase-level alignment and stronger word-alignment models.

Abstract

The field of cross-lingual sentence embeddings has recently experienced significant advancements, but research concerning low-resource languages has lagged due to the scarcity of parallel corpora. This paper shows that cross-lingual word representation in low-resource languages is notably under-aligned with that in high-resource languages in current models. To address this, we introduce a novel framework that explicitly aligns words between English and eight low-resource languages, utilizing off-the-shelf word alignment models. This framework incorporates three primary training objectives: aligned word prediction and word translation ranking, along with the widely used translation ranking. We evaluate our approach through experiments on the bitext retrieval task, which demonstrate substantial improvements on sentence embeddings in low-resource languages. In addition, the competitive performance of the proposed model across a broader range of tasks in high-resource languages underscores its practicality.

Enhancing Cross-lingual Sentence Embedding for Low-resource Languages with Word Alignment

TL;DR

), Word Translation Ranking (

), and Translation Ranking (

)—with the final loss

. Experiments on bitext retrieval, cross-lingual STS, bitext mining, and NLI demonstrate that WACSE yields substantial gains for low-resource languages while maintaining competitive performance on high-resource languages. The results imply that explicit word-level alignment can meaningfully augment cross-lingual sentence embeddings in data-scarce scenarios, with future work aimed at phrase-level alignment and stronger word-alignment models.

Abstract

Paper Structure (31 sections, 11 equations, 2 figures, 10 tables)

This paper contains 31 sections, 11 equations, 2 figures, 10 tables.

Introduction
Related Work
Cross-lingual Sentence Embedding
Token-level auxiliary tasks.
Word Alignment
Method
Acquisition of Word Alignment Supervision.
Aligned Word Prediction (AWP) Task
Word Translation Ranking (WTR) Task
Translation Ranking (TR) Task
Experimental Setup
Training Data
Low-resource Languages
Implementation Details
Model Size.
...and 16 more sections

Figures (2)

Figure 1: t-SNE visualization of sampled word embeddings from both high-resource and low-resource languages. The red points represent the word embeddings from high-resource languages, and the blue points correspond to those from low-resource languages. This comparison highlights the differences of word representation in the models w/ and w/o the explict word-aligned training. Left: words in low-resource languages are under-aligned with their translations in high-resource languages. Right: the phenomenon of under-alignment is mitigated through the proposed explicit word-aligned training. The details of word sampling, word embeddings and word-aligned training are described in Section \ref{['sec:impdetail']}.
Figure 2: Illustration of WACSE framework. A parallel sentence pair is fed into the multilingual model along with a frozen word alignment model to obtain sentence representations, contextual token representations, and word alignment respectively. Then three objectives are calculated: (1) translation ranking: aligning sentence-level semantics; (2) aligned word prediction: utilizing the contextual representations of masked words to predict their aligned counterparts in another language; and (3) word translation ranking: aligning word-level semantics.

Enhancing Cross-lingual Sentence Embedding for Low-resource Languages with Word Alignment

TL;DR

Abstract

Enhancing Cross-lingual Sentence Embedding for Low-resource Languages with Word Alignment

Authors

TL;DR

Abstract

Table of Contents

Figures (2)